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Abstract 



?-H We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor 



decompositions: we prove that given a tensor whose decomposition satisfies a robust form of 
Kruskal's rank condition, it is possible to approximately recover the decomposition if the tensor 
f^ is known up to a sufficiently small (inverse polynomial) error. 

C^ Kruskal's theorem has found many applications in proving the identifiability of parameters 

^_^ for various latent variable models and mixture models such as Hidden Markov models, topic 

^^ models etc. Our robust version immediately implies identifiability using only polynomially 

^^ many samples in many of these settings. This polynomial identifiability is an essential first step 

towards efficient learning algorithms for these models. 

fj Recently, algorithms based on tensor decompositions have been used to estimate the param- 

1—^ eters of various hidden variable models efficiently in special cases as long as they satisfy certain 

, "non-degeneracy" properties. Our methods give a way to go beyond this non-degeneracy bar- 

►^ rier, and establish polynomial identifiablity of the parameters under much milder conditions. 

r»«>. Given the importance of Kruskal's theorem in the tensor literature, we expect that this robust 

OO version will have several applications beyond the settings we explore in this work. 
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1 Introduction 

Statisticians have long studied the identifiabihty of probabihstic models [TeiGlj ITei671 ITC82| . i.e. 
whether the parameters of a model can be learned from data generated by the model. A cen- 
tral question in unsupervised learning |Gha04j is the efficient computation of such latent model 
parameters from observed data. A necessary step towards efficient (polynomial time) learning is 
to show that the parameters are indeed identifiable after observing polynomially many samples. 
The method of moments approach, pioneered by Pearson |Pea94j . infers model parameters from 
empirical moments such as means, pairwise and other higher order correlations. In general, very 
high order moments may be needed for this approach to succeed and the unreliability of empirical 
estimates of these moments leads to exponential sample complexity jMV10| IBSlOl IGLPR12] . 

An exciting sequence of recent work |MR06t IAHK121 IHK12t IAGH"'"12] has met with consider- 
able success in cases where the underlying models satisfy a certain non-degeneracy condition (that 
we will explain later) . Informally, the condition requires that the dimension (n) of the observations 
is at least as large as the number of possible values (R) for the hidden variable and that certain 
model parameters are in general position. The moments are naturally represented by tensors (high 
dimensional analogs of matrices) and low rank decompositions of such tensors can be used to deduce 
the parameters of the underlying model. Under suitable non-degeneracy assumptions, the required 
tensor decompositions can be computed efficiently using an iterative procedure akin to power iter- 
ation for computing matrix eigenvalues. One focus of our work is developing tensor decomposition 
techniques that apply in more general settings where these non-degeneracy assumptions are vio- 
lated, i.e. n is much smaller than R. Such settings do arise in many cases of practical interest 
such as in applications of hidden Markov models to speech recognition and image classification, 
where the dimension (n) of the feature space is typically much smaller than the number of values 
(-R) for the hidden variable. For instance, the (effective) feature space corresponds to just the 
low-frequency components in the fourier spectrum in speech, or the local neighborhood of a pixel 
in images. These are typically low dimensional than the number of words or image classes. 

In fact, the connection of tensor decompositions to learning probabilistic models has been made 
earlier in the algebraic statistics literature. In a series of papers, identifiabihty of several latent 
variable models was established |AMR09| lAPRSllj IRS12J via low rank decomposition of certain 
moment tensors. A fundamental result of Kruskal |Kru77] on uniqueness of tensor decompositions 
plays a crucial role in ensuring that the model parameters are correctly identified by this procedure. 
Note that this assumes access to an infinite number of samples and does not give any information 
on the number of samples needed to learn the model parameters within specified error bounds. 
Kruskal's theorem by itself is not useful for establishing any such sample complexity bounds since 
it only guarantees uniqueness for low rank decompositions of the actual moment tensors. It does 
not say anything about the decomposition of empirical moment tensors which are approximations of 
these. In order to understand how large a sample size is needed, one would need a robust uniqueness 
guarantee of this form: if the empirical moment tensor T' is close to the moment tensor T, then a 
low rank decomposition of T' is (term by term) close to a low rank decomposition of T. 

Our main technical contribution in this work is establishing such a robust version of Kruskal's 
classic uniqueness theorem for tensor decompositions. This provides a uniqueness guarantee that 
is directly applicable for establishing polynomial identifiabihty in a host of applications |AMR09] 
where Kruskal's theorem was used to prove identifiabihty assuming access to exact moment tensors. 
Since polynomially many samples from the distribution (typically) yield an approximation to these 



tensors up to l/poly(n) error, our robust version of Kruskal's theorem establishes polynomial 
identifiability in all such applications. To the best of our knowledge, no such robust version of 
Kruskal's theorem is known in the literature. Given the importance of this theorem in the tensor 
literature, we expect that this robust version will have applications beyond the settings we explore 
in this work. Our robust uniqueness theorem is accompanied by new algorithms to find low rank 
tensor decompositions. 

1.1 Tensors and their Decompositions 

A tensor is a multidimensional array - a generalization of vectors and matrices e.g. an ni x n2 x 713 
tensor is a 3-tensor which is an element in ^"ix^axna^ Lq-^ rank tensor decompositions (analogs of 
SVD for matrices) have been studied intensively as methods for extracting structure in data. These 
originated in work of Hitchcock [Hit27| and Cattell |Cat44j . They were studied in the 60's and 70's 
in the psychometrics literature and since the 80's, in the chemometrics literature. The notion of 
tensor rank also plays an important role in algebraic complexity, and is closely connected to the 
exponent of matrix multiplication. More recently, tensor decompositions have found applications 
in signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, 
graph analysis, neuroscience and more. 

Carroll and Chang |CC70| introduced CANDECOMP (canonical decomposition) and indepen- 
dently, Harshman jHiFTH] introduced PARAFAC (parallel factors). CANDECOMP/PARAFAC is 
now referred to as CP decomposition [KieOOj. It expresses a tensor as a sum of rank-one tensors 
where each rank-one tensor is the outer product of column vectors. The rank of a tensor is the 
minimum number of terms required for such a decomposition. While the definition of tensor rank 
is analogous to that of matrix rank, their properties are quite different. In fact, computing the 
rank of a tensor is NP-hard |IIas90| and in fact several other problems associated with low rank 
approximation of tensors are NP-hard as well |IIL13j . 

For matrices, a fundamental result of Eckart and Young [EY36| shows that the best rank-fc ap- 
proximation consists of the leading k terms of the SVD. This is not the case for CP decomposition of 
tensors - the best rank one approximation may not be a factor in the best rank two approximation. 
In fact, the best rank /^-approximation may not exist. For example, certain tensors of rank-three can 
be arbitrarily well approximated by a sequence of rank-two tensors |Knul IPaaOO| IDSLOSj ILanl2j . 
In fact, the set of tensors of a certain size that do not have a best rank-fc approximation has positive 
volume |DSL08j . To overcome this problem, the concept of border rank was introduced and studied 
in the algebraic complexity community. This is defined to be the minimum number of rank-one 
tensors that are sufficient to approximate the given tensor with arbitrarily small error. In fact, the 
complexity of matrix multiplication is exactly captured by the border rank of the associated tensor 
|KB09[[LaET2] . 

An important property of higher order tensors is that (under certain conditions) their minimum 
rank decompositions are unique upto trivial scaling and permutation. This is in contrast to matrix 
decompositions. Note that the SVD of a matrix is unique (assuming distinct singular values) only 
because we impose additional orthogonality constraints. 

A classic result of Kruskal |Kru77j gives a sufficient condition for uniqueness of the CP decom- 



position of a 3-tensor. Suppose that a 3-tensor T has the foUowing decomposition: 

R 



T=[AB C] = ^Ar®Br®Cr (1) 



r=l 



Let the Kruskal rank or K-rank k^ of matrix A (formed by column vectors A^) be the maximum 
value of k such that any k columns of A are linearly independent, ks and kc are similarly defined. 
Kruskal's result says that a sufficient condition for the uniqueness of the decomposition (II| is 

kA + kB + kc>2R + 2 (2) 

Several alternate proofs of this fundamental result have been given |tBS02t IJS041 ISS07|, IRholO| 
ILanl2j . Sidiropoulos and Bro [SBOOj extended this result to £-order tensors. Let T be a ^-order 
tensor with decomposition 

r=l j=l 

Then the decomposition is unique if 

t 
^%0)>2i? + (^-l) (3) 

We give a robust version of of Kruskal's uniqueness theorem for decomposition of 3-tensors. To 
this end, we need a natural robust analogue of Kruskal rank: we say that K-rank,- (A) > fc if every 
submatrix of A formed by k of its columns has minimum singular value at least 1/r. A matrix is 
called bounded if its column vectors have bounded length. Finally, we measure closeness between 
two tensors or two matrices by the Frobenius norm of their difference. Please see Section [2] for 
precise definitions. 

Our first result shows that any tensor with bounded decomposition that satisfies the robust 
Kruskal condition has a unique decomposition upto small error (formal statement in Section [2]): 

Informal Theorem. // any order 3 tensor T has a hounded rank R decomposition \A B C], 
where the robust K-rankj- kA,kB,kc satisfy kA + ks + kc > 2R + 2, then any decomposition 
[A' B' C] that is e-close to T has A',B',C' being individually e' -close to A,B and C respectively 
when e < e' ■ poly{R, n, r). 



A similar theorem (see Theorem 2.7) also holds for higher order tensors and the analogous 
robust Kruskal rank condition is exactly Q where kfj(j) corresponds to the robust Kruskal rank 
of U^^K Note that when all the U^^' have the same rank, the robust Kruskal condition becomes 
weaker for higher order tensors. 

Why is it non-trivial to obtain a robust version from existing proofs? Kruskal's theorem 
gives conditions under which the components of a tensor decomposition can be identified uniquely. 
However the proofs that we are aware of strongly use inductive lemmas which prove that subsets of 
the components of one decomposition have to necessarily belong in any other potential decomposi- 
tion, and use them to conclude that any two decompositions are in fact the same. When working 



with representations that are only nearly equal, these inductive arguments typically accumulate 
errors in each step, thereby requiring the initial error to be exponentially small in order to reach 
the desired conclusion. Such a result would not be of any value for establishing polynomial sample 
complexity bounds, since the sample size would need to be exponentially large for the empirical 
moment tensors to approximate the true moment tensor within such a low error. We overcome this 
issue by using arguments that are purely combinatorial whenever possible, and carefully avoiding 
a loss at each step. 

Since finding low-rank decomposition of tensors is of great practical interest, it is natural to 
study algorithms for this problem. While this and many related problems are NP-hard in general 
|HL13] . we give an algorithm which given an approximation to a tensor, finds an approximate low- 
rank decomposition in time exponential only in the rank (and not the dimensions of the tensor). 

Informal Theorem. Given a tensor with a bounded, rank R decomposition up to an error e, we 
can find a rank R approximation with error 0(e) in time exp(i?^ \og{n/e))poly{n). 

This can be viewed as a tensor analog of low-rank approximation, which is very well-studied for 
matrices. Note that our algorithm does not require the promised decomposition to have additional 
well- conditioned properties. If we additionally have such guarantees (for e.g., that the sum of K- 



rank of the components is high), then Theorem 2.7 implies that the algorithm finds this particular 
decomposition (up to a small error). 

1.2 Latent Variable Models 

We now describe some of the latent variable models that our results are applicable to. We will 
formally state the identifiability and algorithmic results we obtain for each of these in Section [5J 

Consider a simple mixture-model, where each sample is generated from mixture of R distri- 
butions {^r}re[i?]) with mixing probabilities {wr}r&[R]- Here the latent variable h corresponds to 
the choice of distribution and it can have [R\ possibilities. First the distribution h = r \s picked 
with probability Wr, and then the data is sampled according to 2?^; which has mean ^i^- G I^"- Let 
MnxR represent the matrix of these R means. This setting captures many latent variable models 
including topic models. Hidden Markov Models (HMMs), gaussian mixtures etc. 

Multi-view Mixture Model 

Multi-view models are mixture models with a discrete latent variable /i G [R], such that Pr [h = r] = 
Wr- We are given multiple observations or views x^^\ x^'^', . . . , x^^' that are conditionally indepen- 
dent given the latent variable h, with E [x*^-^-*!/! = r] = fXr ■ Let M'^^' be the n x R matrix whose 
columns are the means {/^r }rG[Kl- The goal is to learn the matrices {M^^'}j^!£] and the mixing 
weights {wrJrelR]- 

Multi-view models are very expressive, and capture many well-studied models like Topic Mod- 
els |AHK12j . Hidden Markov Models (HMMs) [MRnOl EMEM IAHK12] . random graph mixtures 
|AMR09J ■ and the techniques developed for this class have also been applied to phylogenetic tree 
models |Cha96l IMR06| and certain tree mixtures |AHHK12] . 



Exchangeable (single) Topic Model 

The simplest latent variable model that fits the multi-view setting is the Exchangeable Single Topic 
model as given in |AHK12J . This is a simple bag-of- words model for documents, in which the words 
in a document are assumed to be exchangeable. This model can be viewed as first picking the topic 
r G [R] of the document, with probability Wr- Given a topic r £ [R], each word in the document 
is sampled independently at random according to the probability distribution fir £ IK" {n is the 
dictionary size). In other words, the topic r G [R] is a latent variable such that the i words in a 
document are conditionally i.i.d given r. 

The views in this case correspond to the words in a document. This is a special case of the 
multi-view model since the distribution of each of the views j £ [i] is identical. 

Hidden Markov Models 

Hidden Markov Models (HMMs) are extensively used in speech recognition, image classification, 
bioinformatics etc |Edd96l IGY08] . We follow the same setting as in |AMR09] : there is a hidden 
state sequence Zi,Z2, ■ ■ ■ , Z^ taking values in [R], that forms a stationary Markov chain Zi — t- 
^2 —)•••• —)• Zm with transition matrix P and initial distribution w = {wr}re[R] (assumed to be 
the stationary distribution). The observation Xt is represented by a vector in x'*-* G M". Given the 
state Zt at time t, Xt (and hence x'*-*) is conditionally independent of all other observations and 
states. The matrix M (of size n x R) represents the probability distribution for the observations: 
the r^^ column Mr represents the probability distribution conditioned on the state Zt = r i.e. 



yr £[R],t£[m],i£ [n], Pr [Xt = i\Zt = r] = M, 



IT' 



As mentioned previously, in many important applications of HMMs, n is much smaller than R. 
e.g. in image classification, the commonly used SIFT features |Low99j are 128 dimensional, while 
the number of image classes is much larger, e.g. 256 classes in the Caltech-256 dataset |GHP07| 
and several thousands in the case of ImageNet [DDS"'"09] . Similarly, in speech recognition, the 
features of an audio signal are typically based on mel-frequency cepstral coefficients (MFCCs) or 
an encoding called perceptual linear prediction (PLP) that incorporates psychoacoustic constraints 
|GY08j . e.g. these are used to obtain a 39 dimensional feature vector in the popular HTK toolkit 
for building HMMs for speech recognition YEG"'"02l IWGPY97] . On the other hand, the number 



of states in these HMMs is much larger. Further, in some other applications, even when the feature 
vectors lie in a large dimensional space (n S> -R) , the set of relevant features or the effective feature 
space could be a space of much smaller dimension {k < R), that is unknown to us. 

Mixtures of Spherical Gaussians 

Learning mixtures of Gaussians has a long and rich history - our overview is necessarily brief and 
focuses on work relevant to our results. We consider the setting where we have a mixture of R 
spherical gaussians in M", with mixing weights tfi, u;2, . . . wr, means ni, fj,2, ■ ■ ■ , IJ-r, and the com- 
mon variance a^. Much work on this problem needs certain separation guarantees between the 
centers |Das99L lAKOll IVW04[ IAM05L IDEOTI IKKlOl EET2] . Recently, moment methods were devel- 
oped for arbitrary gaussians |KMV10] IMVlOl IBSlOj , albeit with sample complexity and running 
time exponential in R - such dependence is necessary in general. Recent work jAGH"*" 12l IHK13J 



developed efficient algorithms for special cases of this problem without needing any separation as- 
sumptions. These methods, based on tensor decompositions, need the condition that the means 
are linearly independent and hence necessarily n > R. (additionally, the matrix of means need to 
be well conditioned, i.e. the means should not be close to a low dimensional subspace). 

We apply our results on tensor decompositions to many of the latent variable models described 
above. Here is a representative result for multi-view models that applies when the dimension of 
the observations (n) is 6R where 5 is a small positive constant and R is the size of the range of the 
hidden variable, and hence the rank of the associated tensors. In order to establish this, we apply 
our robust uniqueness result to the £^^ moment tensor for i = [2/5] -|- 1. 

Informal Theorem. For a multi-view model with R topics or distributions, such that each of the 
parameter matrices M"' has robust K-rank of at least 6R for some constant 6, we can learn these 
parameters upto error e with high probability using poly^{n, R) samples. Further, these parameters 
can be approximately computed in time exp^ (iZ^ log(n/e)) poly{n) time. 

Polynomial identifiability was not known previously for these models in the settings that we 
consider. Moreover, except for the well studied setting of mixtures of Gaussians, no provably good 
algorithms were known (even with running time exp(poly(i?))). 

For mixtures of Gaussians, our results shed more light on polynomial identifiability: the algo- 
rithm of AGH"'"12l IHK13J shows how to identify mixtures of (spherical) Gaussians efficiently when 



we have R Gaussians in d dimensions, when the means satisfy certain well-conditioned properties 
(which in particular requires d > R). When d = 2, Moitra and Valiant |MV10j rule out polyno- 
mial identifiability by giving two distributions for which we require exponentially many samples to 
distinguish one from the other. Thus it is natural to ask what happens in between, when d < R, 
but is not too small. Our results imply that a mixture of R Gaussians of known variance in a 6R 
dimensional space (any 5 > 0) can be identified with polynomially many samples. 

1.3 Overview of Techniques 

Robust Uniqueness of Tensor decompositions. The main technical contribution of our paper 
is the Robust Uniqueness theorem for Tensor decompositions. Our proof broadly follows the outline 
of Kruskal's original proof |Kru77| : It proceeds by establishing a certain Permutation lemma, which 
gives necessary conditions to conclude that the columns of two matrices are permutations of each 
other (up to scaling). Given two decompositions [A B C] and [A' B' C] for the same tensor, it 
is shown that A, A' satisfy the conditions of the lemma, and thus are permutations of each other. 
Finally, it is shown that the three permutations for A, B and C (respectively) are identical. To 
prove the robust uniqueness theorem, the key ingredient is a robust version of the permutation 
lemma. 

The ffist step in our argument is to prove that if A, B, C are "well-conditioned" (i.e., satisfy the 
K-rank conditions of the theorem), then any other "bounded" decomposition which is e-close is also 
well-conditioned. This step is crucial to our argument, while an analogous step was not explicitly 
needed for the proofs of exact uniqueness theoremjj Besides, this statement is interesting in its 
own right: it implies, for instance, that there cannot be a smaller rank (bounded) decomposition. 

The second and most technical step is to prove the robust permutation lemma. The (robust) 
^Note that the uniqueness theorem, in hindsight, establishes that the other decomposition is also well-conditioned. 



Permutation lemma needs to establish that for every column of C", there is some column of C close 
to it. Kruskal's proof |Kru77j roughly uses downward induction to establish the following claim: 
for every set of i < K-rank columns of C, there are at least as many columns of C that are in 
the span of the chosen vectors. The downward induction infers this by considering intersections of 
columns close to i + 1 dimensional spaces. 

The natural analogue of this approach would be to consider columns of C which are e-close to the 
spans of subsets of columns of C . However, the inductive step involves considering combinations 
and intersections of the different spans that arise, and such arguments do not seem very tolerant to 
noise. In particular, we lose a factor of rn in each iteration, i.e., if the statement was true for i + 1 
with error Ej+i, it will be true for i with error Si = rn ■ ej+i. Since k steps of downward induction 
need to be unrolled, we recover a robust permutation lemma only when the error < l/^rn)^ to 
start with, which is exponentially small since k is typically 0(n). 

We overcome this issue by showing a different, more tricky inductive statement, whereby we do 



not lose any error in the recursion. This is described in Section 3.3 To carry forth this argument 



we crucially rely on the fact that C" is also "well-conditioned" and other observations. 

Algorithms for lo^v-rank tensor decompositions. At a high level, our algorithm for finding a 
rank R approximation proceeds by finding a small {0{R)) dimensional space and then exhaustively 
searching, which takes time exp(i?^logn)poly(n). Note that a naive exhaustive search using an 
e-net in the entire n dimensional space would incur a run time of exp(i2n)poly(n), which is much 
worse if re >> i?. 

Suppose the best rank R approximation to an input tensor has error e. We first find an 
i?-dimensional space for each of the (three) dimensions, so that there is an 0(e)-close rank R de- 
composition that comprises vectors only from the corresponding i?-dimensional spaces. We note 
that the spaces we find need not correspond to the span of the components in the optimum de- 
composition, but they suffice to obtain an 0(e) approximation. Another feature of the algorithm 
is that it does not assume that the tensor has an approximate "well conditioned" decomposition, 
and assumes only boundedness. 

1.4 Related Work 

While our applications to learning latent variable models are inspired by the works of |AHK12t 
|AGH"'"12] . our results are significantly different, particularly from a tensor decomposition perspec- 
tive. Anandkumar et al jAGH"*" 12] give algorithms for tensors which have a symmetric orthogonal 
decomposition, i.e. a decomposition of the form Ylr=i Ar Ar Ar where the vectors Ar are or- 
thogonal. In general, a rank-i? tensor may not have any orthogonal decomposition. Note that any 
tensor in re x n x n dimensions, which has rank R> n can not have an orthogonal decomposition. 
While this is one source of intractability for general tensor decompositions |HK12| . we crucially 
use such tensors of rank R > n to give polynomial identifiability beyond the non-degenerate range 
{R<n). 

For various latent variable models, in the non-degenerate setting (where the number of mixtures/ 
topics R is larger than the dimension of the space re), Anandkumar et al |AGH^ 12 use order 3 



tensors given by the third moment tensor to identify the hidden parameters. In these tensors, 
each rank-1 component corresponds to a hidden parameter, like one of the means. While these 



parameters may not be orthogonal, a certain "whitening^^ transform of the space |AHK12[ IHK12J 
produces a new instance in which these means are now orthogonal. For this they crucially rely on 
two assumptions: 

• The n X R matrix of the means has rank > R (and well conditioned). This of course needs 
R<n. 

• The algorithm has access to the second moment tensoiQ This assumption will not hold in 
the case of the general problem of tensor decompositions. 

Finally, in the context of learning latent variable models, we go beyond the non-degeneracy 
barrier and get polynomial identifiability even when n = 5R < R. One interesting aspect of our 
results is that we use successively higher 0(l)-moments to handle larger values of R (hidden topics/ 
mixtures) . This smooth tradeoflrl is in contrast to the works of ^AHK12t IHK12t IAGH^12| , where 
they seem to get no additional advantage out of higher moments (larger than 3). Further, even 
when using third moments, |AHK12[ IHK121 |AGH"'"12| only obtain polynomial identifiability when 
R < n, whereas we obtain polynomial identifiability till R = 3n/2 — 1. On the other hand, since we 
argue about identifiability directly through uniqueness theorems for tensors, it allows us to handle 
larger values of R. 

We also mention work on PAC learning of mixtures of k product distributions (see e.g. |FOS05| 
IFSO06J ) that typically run in exp(/c)poly(n) time and produce a distribution that is statistically 
close to the underlying distribution - however they do not recover the actual mixture components 
themselves. 



2 Some preliminaries and our results 

We start with basic notation on tensors which we will use throughout the paper. We then state 
our results formally in these terms, and place them in context. In the process, we will see some 
intriguing properties of tensors (relevant to our results) which distinguish them from matrices. 

2.1 Notation and Preliminaries 

Tensors are higher dimensional arrays. An ith. order, or ^-dimensional tensor is an element in 
j^nixn2x--xn£^ for positive integers rij. Tensors have classically been defined over complex numbers 
for certain applications, but we will consider only real tensors. 

A concept that plays a crucial role for us is that of the rank of a tensor. For this, we first define 
a rank-1 tensor as a product a^^' a^'^' . . . a^^\ where a^*-* is an n, dimensional vector. We can 
now define the rank. 

Definition 2.1 (Tensor rank. Rank R decomposition). The rank of a tensor T G ]^nixn2x--xnf ^g 
defined to be the smallest R for which there exist R rank-1 tensors T^^' whose sum is T. 

^This is certainly a valid assumption when learning latent variable models 

^Note that the R^^ moment is sufficient to identify the parameters typically [BSIOI IMVTol IFSO06) . 



A rank-i? decomposition of T is given by a set of matrices U^^' ,U^'^\ . . . ,U^> with U^^' of 
dimension m x R, such that we can write T = \U^'^' U^'^' . . . U^ '], which is defined by 



r=l 

where we use the notation A^ to denote the rth column vector of matrix A. 

Third order tensors (or 3-tensors) play a central role in understanding properties of tensors in 
general (as in many other areas of mathematics, the jump in complexity occurs most dramatically 
when we go from two to three dimensions, in this case from matrices to 3-tensors). For 3-tensors, 
we will often write the decomposition as [A B C], where A,B,C have dimensions nA,nB,nc 
respectively. 

Definition 2.2 (e-close). Two tensors, represented by Ti = [[/(-^^ C/^^^ ... U^^^] and T2 = 
[y(i) y(2) . . . y(^)] (of potentially different rank) are said to be e-close if the Frobenius norm 
of the difference is small, i.e., 



[[/(I) [/(2) . . . C/W] - [y(i) T/(2) . . . i/W] 
We will sometimes write this as Ti =e T2. 



<£ 

F 



Unless mentioned specifically, the errors in the paper will be £2 (or Frobenius norm, which is 
the square root of the sum of squares of entries in a matrix/tensor) , since they add up conveniently. 

Definition 2.3 (p-boundedness) . An nx R matrix A is said to be p-bounded if each of the columns 
has length at most p, for some parameter p. 

A tensor represented as above, [U^^' U^"^' . . . U^^>\, is {pi, p2, ■ ■ ■ , /7£)-bounded if the matrix U^"^' 
is Pi bounded for all i. 

We next define the notion of Kruskal rank, and its robust counterpart. 

Definition 2.4 (Kruskal rank, K-rankT-(.)). Let A he an n x R matrix. The K-rank (or Kruskal 
rank) of A is the largest k for which every set of k columns of A are linearly independent. 

Let r be a parameter. The r-robust k-rank is denoted by K-rank,- (j4), and is the largest k for 
which every n x k sub-matrix A\g of A has ak{A\g) > 1/r. 

Note that we only have a lower bound on the (A;th) smallest singular value of A, and not for 
example the condition number o"jnax/cfc- This is because we will usually deal with matrices that 
are also p-bounded, so such a bound will automatically hold, but our definition makes the notation 
a little cleaner. We also note that this is somewhat in the spirit of (but much weaker than) the 
Restricted Isometry Property (RIP) jCTOSj from the Compressed Sensing literature. 

Another simple linear algebra definition we use is the following 

Definition 2.5 (e-close to a space). Let F be a subspace of M", and let H be the projection matrix 
onto V . Let u € M". We say that u is e-close to V if \\u — Hull < e. 
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Other notation. For z G M'^, diag(z) is the (i x d diagonal matrix with the entries of z occupying 
the diagonal. For a vector z G M , nz(z) denotes the number of non-zero entries in z. Further, 
nz£{z) denotes the number of entries of magnitude > e. As is standard, we denote by o"j(^) the 
ith largest singular value of a matrix A. Also, we abuse the notation of ® at times, with u® v 
sometimes referring to a matrix of dimension dim{u) x dim{v), and sometimes a dim{u) ■ dim{v) 
vector. This will always be clear from context. 

Normalization. To avoid complications due to scaling, we will assume that our tensors are 
scaled such that all the ta,tb, . • . , are > 1 and < poly(n). So also, our upper bounds on lengths 
PA,Pb, ■ ■ ■ are all assumed to be between 1 and some poly(n). This helps simplify the statements 
of our lemmas. 

Error polynomials. We will, in many places, encounter statements such as "if Qi < e, then Q2 < 
(3n^7) • e" , with polynomials '& (in this case 3n?'y) involving the variables n, R,kA,kB,kc,T, p, . . . . 
In order to keep track of these, we use the notation "i?!, "(92) • • • • Sometimes, to refer to a polynomial 
introduced in Lemma 3.11, for instance, we use "iJs.ii. Unless specifically mentioned, they will be 
polynomials in the parameters mentioned above, so we do not mention them each time. 

2.2 Our Results 

We are now ready to formally state the results in our work. The first is a robust version of the 
uniqueness of decomposition for 3-tensors. 

Theorem 2.6 (Unique Decompositions). Suppose a rank-R tensor T = [A B C] is {pA, Pb, Pc)- 
bounded, with K-rankrjy{A) = kA, K-rankrg{B) = ks, K-rankrci^) — ^C satisfying kA + ks + kc > 
2R + 2. Then for every < e' < 1, there exists 

£ = e' IJR^^^TA, PA, p'A,nA)'dn:dTB, PB, PR,nB)T^TdTc, PC, p'n, np)), 

for some polynomial T feg] such that for any other (p'^, p'^, p'(^) -bounded decomposition [A' B' C'] of 
rank R that is e-close to [A B C], there exists an (R x R) permutation matrix 11 and diagonal 
matrices A.a,-^-b,^c such that 

IIA^AbAc - /||^ < e' and ||^' - yUIAyill^ < e' [similarly for B and C) (4) 

We remark that in order to prove the theorem, we did not make any assumptions about the 
Kruskal ranks of A',B',C'. We simply assumed that they are bounded. This is an interesting 
feature of our proof, and is formalized in Lemma |3.4[ Another observation: though we assumed 
that the decomposition [A' B' C] is rank i?, we really need only an upper bound. This is because 
we can append zeroes and apply the theorem. 

Our next result is a higher dimensional analogue of the above. 

Theorem 2.7 (Uniqueness of Decompositions for Higher Orders). Suppose we are given an order i 
tensor (with i < R), T = \U^^> U^'^' . . . U^^>], where \/j G [£] the Uj-by-R matrix U^^' is pj -bounded, 
with K-rankTj{U^^') = kj > 2 satisfying 

e. 
^kj>2R + e-l. 
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such that, 



Then for every < e' < 1, there exists e = [i%y][ fj ) ) " ( DjeW ''^^f^lJTi, Pi, p'j,ni 

for any other [p'^, p'2, . . . , p'^)-bounded decomposition [V^^^ V^'^' . . . V^^'] which is e-close to T, th 
exists an R X R permutation matrix H and diagonal matrices {A'-^'lj^m such that 



ere 






< £ and Vj G 



y{i) _ f/OOnA^-'') 



<£' 



(5) 



Setting "d^m^x) 



andi^2l(Ti,Pi,p'j,n 



<j,nj) = [Tjpjpjuj 



) '"^•' suffice for the theorem. 



Since finding a small rank decomposition of a tensor is of great practical interest as we have 
seen, it is natural to ask if it is possible to compute it efficiently. We can prove: 

Theorem 2.8. Suppose T is a 3-tensor which has an (unknown) p-bounded representation [A B C], 
where A, B, C have dimensions tla x i?, ub x R and nc x R respectively, for some parameter p. 
Then, given a tensor T' which is e-close to T, we can find a rank-R tensor T" (along with its 
decomposition) which is Be close to T in time poly{nA,nB,nc) ■ exp(i?^log(-R/9/e)). 

We can view the above as an approximation algorithm for the low-rank approximation problem 
for tensors. We will expound on this viewpoint in Section |4] We also note that although our 
algorithm is quite simple, it has a running time better than simply trying to guess the 3R vectors 
in the decomposition. The latter typically takes time exp{R{nA + nB + nc)), which could be much 
worse than our bound for small values of R (which is when the low rank approximation problem is 
typically interesting) . 

As we mentioned before, the algorithm does not need the promised decomposition [A B C] to 
have large K-rank . However, if we are guaranteed that it has additional well-conditioned properties 



(for e.g., the sum of K-rank of A, B,C is > 2R+2), then Theorem 2.7 guarantees that the algorithm 
finds this particular decomposition (up to a small error). 

Also, the algorithm extends naturally to higher dimensional tensors: we state this version in 
Section [4j Theorem 4.5 



Finally, we show how the above results on tensor decompositions can be used to learn latent 
variable models with polynomial samples, hence showing polynomial identifiability under some weak 
conditions involving the K-rank of the matrices. We first show polynomial identifiability for the 
Multi-view mixture model, which captures various latent variable models that are used commonly. 

Theorem 2.9 (Polynomial Identifiability of Multi-view mixture model). The following statement 
holds for any constant integer L Suppose we are given samples from a multi-view mixture model 



(see Def 5.2), with the parameters satisfying: 



(a) For each mixture r G [R], the mixture weight Wr > ^■ 

(b) For each j G [£], K-rankr{M^^^) > A; > ^ + 1. 

then there is a algorithm that given any rj > uses N = ^I fegf ' f -, i?, n, r, I/7, Cmax I samples, 



and finds with high probability {M^-^^j^gj^j and {wr}re[R] (w^o renaming of the mixtures {1,2,..., R}) 
such that 



VjG 



m(j) - m(^') 



< rj and Vr G [R], \wr — Wr\ < rj 



(6) 
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Further, this algorithm runs in time exp I R 



?2/)2 



22^1og(f ) + ilog{n ■ ^^)]) poly{n) 



time. 



Polynomial identifiability of the Multi-view mixture model also leads to polynomial identifia- 
bility of other latent variable models like topic models and HMMs. The following corollary shows 
that Hidden Markov models can be learned from polynomial many samples by observing constant 
number of consecutive time steps under mild conditions involving the K-rank (the constant depends 
on the exact K-rank condition) . Please refer to section p] to see the implications for other latent 
variable and mixture models like topic models, mixtures of gaussians etc. 



Corollary 5.5 (Polynomial Identifiability of Hidden Markov models). The following statement 
holds for any constant 6 > 0. Suppose we are given a Hidden Markov model with parameters as 
follows : 

(a) The stationary distribution {wr}r&[R] ^0,^ ^'^ ^ [-R] ^r > 7i, 

(h) The observation matrix M has K-rankr{M) > k > 5R, 

(c) The transition matrix P has minimum singular value (Jr{P) > 72, 

then there is a algorithm that given any r] > uses N = "i^gTgf^^ ( -1 -Rj ^) ''"1 ) samples 

of m = 2[^] +3 consecutive observations (of the Markov Chain), and finds with high probability, 
P\M' and {wr}r&[R\ such that 

||M-M'||^ < 7?, \\P -P'\\p<r] andMr£[R], \wr - Wr\ < 'q (7) 

burther, this algorithm runs m time n ''^'i I n • — — 1 time. 

Note that the above results shows polynomial identifiability (for constant 5 > 0), and addition- 
ally gives an algorithm which takes time n '^' )poly(n, r, i?) for inverse polynomial error. To the 
best of our knowledge such algorithmic results with only a polynomial dependence on n were not 
known for learning HMMs and topic models. 

2.3 Auxiliary lemmas 

In our proofs we will require several simple (mostly elementary linear algebra) lemmas. The Sec- 
tion [A] is a medley of such lemmas. Most of the proofs are reasonably straightforward, and thus we 
place them in the Appendix. 

3 Uniqueness of Tensor Decompositions 



First we consider third order tensors and prove Theorem 2.6 (Sections 3.1 and 3.2). Our proof 
broadly follows along the lines of Kruskal's original proof of the uniqueness theorem |Kru77j . The 
key ingredient, which is a robust version of the so-called permutation lemma is presented in Sec- 
tion l3]3l since it seems interesting its own right. Finally we will see how to reduce the case of higher 



order tensors, i.e. Theorem 2.7, to that of third order tensors (Section 3.4). 
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3.1 Uniqueness Theorem for Third Order Tensors 



The proof of Theorem 2.6 broadly has two parts. First, we prove that if [j4,S,C] = [A' ,B' ,C'], 
then ^ is a permutation of ^', B of B' , and C of C". Second, we prove that the permutations in 
the (three) different "modes" (or dimensions) are indeed equal. Let us begin by describing a lemma 
which is key to the first step. 

The Permutation Lemma This is the core of Kruskal's argument for the uniqueness of tensor 
decompositions. Given two matrices X and Y, how does one conclude that the columns are per- 
mutations of each other? Kruskal gives a very clever sufficient condition, involving looking at test 
vectors w, and considering the number of non-zero entries of w X and w Y. The intuition is that 
if X and Y are indeed permutations, these numbers are precisely equal for all w. 

Kruskal proves that if this sufficient condition holds, then X and Y must have columns which 
are permutations of each other, up to scaling. More precisely, suppose X, Y are n x R matrices of 
rank k. Let nz(x) denote the number of non-zero entries in a vector x. The lemma then states 
that if for all w, we have 

nz{w^X) < R-k + 1 =^ nz{w^Y) <nz{w'^X), 

then the matrices X and Y have columns which are permutations of each other up to a scaling. 
That is, there exists an R x R permutation matrix H, and a diagonal matrix A s.t. Y = XIIA. 

We prove a robust version of this lemma, stated as follows (recall the definition of nzel-), 
Section [2]) 

Lemma 3.1 (Robust permutation lemma). Suppose X,Y are p-bounded nx R matrices such that 
K-rankr{X) and K-rankriY) are > k, for some integer k >2. Further, suppose that for e < l/ifej], 
the matrices satisfy: 

\/w s.t. nz(w X) < R — k + 1, we have nz^lw Y) < nz{w X), (8) 

then there exists an R x R permutation matrix H, and a diagonal matrix A s.t. X and Y satisfy 
\\X — ynA|| F < tfen- e. in fact, we can pick {Jj^j] := (rai?^)-%3i. 

Outline of the section. In the remainder of this section, we will prove that A' is a permutation 



of A, B' of B and C of C. We do this by assuming Lemma 3.1 for now (it will be proved in 



Section 3.3) and proving that if [A B C] =e [A' B' C], then the conditions of the lemma hold for 
C ,C as X,Y in the statement respectively. We can repeat this argument with A, B to obtain the 
conclusion. 



We now state the key technical lemma which allows us to verify that the hypotheses of Lemma 3.1 
hold. It says for any kc — 1 vectors of C there are at least as many columns of C which are close 
to the span of the chosen columns from C". 



Lemma 3.2. Suppose A, B, C, A' , B' , C satisfy the conditions of Theorem, 2.6. and suppose [A B C] 
[A' B' C']. Then for any unit vector x, we have 

Ve', nz^,{x'^C') < R - kc + I =^ nZe"{x^C) < nZe'{x'^C') 

for e" = %2]- (e + e'), where d^^:= AR^{tatbTcY PAPBPciPAPBP'cf ■ 
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Remark. This lemma, together with its corohary Lemma |3.4| will imply the conditions of the 



permutation lemma. Lemma 3.4 lets us conclude that K-rank^^(C') > K-rankT-(C) for some error 



polynomial ■(?, which is essential in our proof of the permutation lemma. It also has other implica- 



tions, as we will see. While the proof of the robust permutation lemma (Lemma 3.1) will directly 
apply this Lemma with e' = 0, we will need the e' > case for establishing Lemma 3.4 

A key component of the proof is to view the three-dimensional tensor \A i? C] as a bunch of 
matrix slices, and argue about the rank (or conditioned-ness) of weighted combinations of these 
slices. One observation, which follows from the Cauchy-Schwarz inequality, is the following: if 
[A B C] =s [A' B' C], then by taking a combination of slices along the third dimension (with 
weights given by x G M""^, i.e., reweighing the ith slice by Xi and summing these matrices) we have 

Vx e M"c:^ 11^ diag(x^C) B^ - A' diag(x^C") {B'ffp < e^ \\x\\l ■ (9) 

We now begin the proof of the Lemma. 



Proof of Lemma \3.^ W.l.o.g., we may assume that kA > ^b (the proof for kA < ks will follow 
along the same lines). For convenience, let us define a to be the vector x^C, and (3 the vector 
X C . Let t be the number of entries of /3 of magnitude > e' . The assumption of the lemma implies 
that t < R — kc + ^- Now from ([9]), we have 

M:=Y^ otiAi ®Bi = Y^ PiA'i ® B'^ + Z, (10) 

i i 

where Z is an error matrix satisfying ||.Z^||p < e. Now, since the RHS has at most t terms with 
|/3j| > e', we have that at+i of the LHS is at most Rp'j^p'^e' + e. Using the value of t, we obtain 

(7^_fe^+2(M) < c7i+i(M) < e + {RpAPBy (11) 

We will now show that if x C has too many co-ordinates which are larger than e" then we 



will contradict (11). One tricky case we need to handle is the following: while each of these non- 
negligible co-ordinates of x C will give rise to a large rank-1 term, they can be canceled out by 
combinations of the rank-1 terms corresponding to entries of x C which are slightly smaller than 
e" . Hence, we will also set a smaller threshold b and first handle the case when there are many 
co-ordinates in x C which are larger than 5. b is chosen so that the terms with {x C)i < 6 can 
not cancel out any of the large terms {{x'^C)i > e"). 

Define Si = {i : |(x-^C)j| > e"} and 5*2 = {^ : |(x^(7)i| > 5}, where 6 = e" I'd for some error 
polynomial 'Q = 2R^ paPbPcp'aP'bP'c'''A'Tbtc (which is always > 1). Thus we have 5*1 C 82- We 
consider two cases. 

Case 1: |5'2| > ks- 



In this case we will give a lower bound on aR-kQ+2{^I)i which gives a contradiction to (11). The 
intuition is roughly that A, B have kA-, ks large singular values, and thus the product should have 
enough large ones as well. To formalize this, we use the following well-known fact about singular 
values of products, which is proved by considering the variational characterization of singular values: 

Fact 3.3. Let P, Q be matrices of dimensions p x m and m x q respectively. Then for all £, i such 
that i < min{p, q}, we have 

ae{PQ) > ae+^^i{P)a^{Q) (12) 
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Now, let us view M as PQ, where P = A, and Q = diag{a)B'^. We will show that cr^giQ) > 



S/tbi and that a2R+2-kB-kc{^) ^ '^/'Ta- These will then imply a contradiction to (11) by setting 
i = R — kc + 2 and i = ks since 

> {Rpj^p^e' + e) by our choice of "d^^ 



TATB '&TATB 

(It is easy to check that i < m.m{kA, ks} < min{n^, ub}, and thus we can use the fact above.) 

Thus we only need to show the two inequalities above. The latter is easy, because by the 
hypothesis we have 2R + 2 — kB — kc < k^, and we know that ^"^^(A) > 1/ta, by the definition 
of K-ranki-^ (^) . Thus it remains to prove the second inequality. To see this, let J C «S'2 of size 
kB- Let BJ and Qj be the submatrices of B^ and Q restricted to rows of J. Thus we have 
Qj = diag{a) jB J . Because of the Kruskal condition, every kB sized sub matrix of B is well- 
conditioned, and thus akg{Bj) = akg{Bj) > 1/tb- 

Further, since \aj\ > 6 Vj G J, multiplication by the diagonal cannot lower the singular values by 
much, and we get ( JknQ j ^ 5/tb- This can also be seen formally by noting that akg(d\&g[a)j) > 5, 
and applying Fact 3.3 with P = diag(a) j, Q = BJ and I = m = i = ks- 



Finally, since Q is essentially Qj along with additional rows, we have a-rg (Q) > cr-rg (Qj) > S/tb- 
From the argument earlier, we obtain a contradiction in this case. 

Case 2: |S'2| < /c^. 

Roughly, by defining 5*1,52, we have divided the coefficients Oj into large (> e"), small, and 
tiny (< 5). In this case, we have that the number of large and small terms together (in M, see 



Eq. (10)) is at most kB- For contradiction, we can assume the number of large ones is > t + 1, 
since we are done otherwise. The aim is to now prove that this implies a lower bound on at+i{M), 
which gives a contradiction to Eq. ( |11[ ). 

Now let us define M' = X]je5 (^ii^i'^Bi)- Thus M and M' are equal up to tiny terms. Further, 
let n be the matrix which projects a vector onto the span of {B'- : |/3j| > e'}, i.e., the span of the 
columns of B' which correspond to |/3j| > e'. Because there are at most t such /3j, this is a space of 



dimension < t- Thus we can rewrite Eq. (10) as 



M' = Y^ ai{Ai (g)Bi)+ Yl "^A^j ® Bj) = Y^ pi{A[ B[) + Err, (13) 

ieSi jeS2\5i i=l 

where we assumed w.l.o.g. that \(3i\ > e' for i G [i], and Err is an error matrix of Frobenius norm 
at most £ + R{paPb5 + p'aP'b^') < ^^ + {RpaPbp'aP'b){^ + '^O- 

Now because |5'i| > t + 1, and K-rank^-^ (i?) > /cb > t + 1, there must be one vector among 
the Bi-, i G Si, which has a reasonably large projection orthogonal to the span above, i.e., which 
satisfies 

\\B^-UB^\\^>l/{TBy/R)- 



Let us pick a unit vector y along Bi — IlBi- Consider the equality (13) and multiply by y on both 
sides. We obtain 

Yai{Bi,,y)Ai = {Err)y. 

i£S2 

Thus we have a combination of the Ai^s, with at least one coefficient being > e" /(Rtb), having a 
magnitude at most ||(i?rr)y||2 < '&i{S + e' + e), where -i?! was specified above. 
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Now k^ > ks > \S2\- So, we obtain a contradiction by Lemma |A.l since: 



||(^rr)y||2 < t?i((5 + e' + e) = RpaPbp'ap'b{^ + e' + e) 

= RpaPbPaPb{^ +e' + e) 

1 e" 
< 



TA Rtb 

The last inequality follows because i? = 2R^ paPbPcp'aP'bP'c'''A'Tbtc- 

This completes the proof in this case, hence concluding the proof of the lemma. D 

The next lemma uses the above to conclude that K-rank^T-(C") > K-rankT-(C), for some poly- 
nomial "!?. 



Lemma 3.4. Let A,B,C,A',B',C' be as in the setting of Theorem 2.6. Suppose [A B C] =e 
[A' B' C], with 

e < 1/^4], where %!]= Rtatbtci^= 4:R^TATBTcPAPBPc{p'AP'BPcf ■ 
Then A',B',C' have K-rank^-/ to be at least kA,kB,kc respectively, where t' := ■ %:^ . 

Remark. The lemma implies that if T has a well-conditioned decomposition which satisfies the 
Kruskal conditions, then any other bounded decomposition which is a sufficiently good approxima- 
tion should also be reasonably well-conditioned. Further, it says that the decomposition [A' B' C] 
can not be of rank < R. Otherwise, we could add some zero-columns to each of A',B',C' and 
apply this lemma to conclude K-rank of ^' is > 2, a contradiction if there exists a zero column. 

Proof. By symmetry, let us just show this for matrix C" (dimensions n x R), and let k = kc for 
convenience. We need to show that every n-hy-k submatrix of C has minimum singular value 
> S = l/r'c. 

For contradiction let C'g be the submatrix corresponding to the columns in S {\S\ = k), such 
that (Jk{C'g) < 5. Let us consider a left singular vector z which corresponds to ak{C'g), and suppose 
z is normalized to be unit length. Then we have 



{z,Clf<5' 






3.2 



we have 



Thus I {z, CD I < 6 for all i G 5, so we have nzs{z'^C') < n — k. Now from Lemma 

nZgj^{zC) < n — k, where ei = 'd\j^e + 6). 

Let J denote the set of indices in z'^C which are < ei in magnitude (by the above, we have \J\ > 
k). Thus we have ||zCj||2 < Rei, which leads to a contradiction if we have K-ranki/(^g^)(C) > k. 

Since this is true for our choice of parameters, the claim follows. D 

Once we have the lemmas above, let us check that the conditions of the robust permutation 



lemma hold with C',C taking the roles of X,Y in Lemma 3.1, and k = kc, and r = "ife^]' tq 



From Lemma 3.4 it follows that K-rankT-(C) and K-rankT-(C") are both > k, and setting e' = 
in Lemma 3.2 the other condition of Lemma |3.1| holds. Thus we can conclude that there exists a 
permutation matrix Uq and a diagonal matrix of scalars Ac such that ||C" — CHcAcUp' is small. 
We will see the quantitative details in what follows. 
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3.2 Wrapping up the proof 



We are now ready to complete the robust Kruskal's theorem. From what we saw above, the main 
part that remains is to prove that the permutations in the various dimensions are equal. 



Proof of Theorem \2.6[ Suppose we are given an e' < 1 as in the statement of the theorem. For a 
moment, suppose e is small enough, and A, B, C, A' , B' , C satisfying the conditions of the theorem 
produce tensors which are e-close. 

From the hypothesis, note that /c^, A;^, kc > 2 (since kA, ks, kc < R-, and kA+ks+kc > 2R+2). 



Thus from the Lemmas 3.4 and 3.2 (setting e' = 0), we obtain that C, C satisfy the hypothesis of the 



Robust permutation lemma (Lemma 3.1) with C' ,C set to X,Y respectively, and the parameters 

"r" := %!] ; V := ^JF- 



Hence, we apply Lemma 3.1 to A^B and C, and get that there exists permutation matrices 11^, 
n^ and He and scalar matrix A^i, A^, Ac such that for 82 = ^nn'fe^" £> 



lA'-AnAAAlL <e2, \\B' - BIlBKB\\r,<e2 and ||C"-CncAc|L <e2 



(14) 



We now need to prove that these three permutations are in fact identical, and that the scalings 
multiply to the identity (up to small error). 

To show n^ = Hs = He: 

Let us assume for contradiction that Ha 7^ n^. We will use an index where the permutations 
disagree to obtain a contradiction to the assumptions on the K-rank . 

For notational convenience, let tta '■ [R\ -^ [R\ correspond to the permutation given by H^, 
with 7r^(r) being the column that A'^ maps to. Permutation ttb ■ [R\ -^ [^] similarly corresponds 



to n^. Using (14) for A we have 






B'r®C',\\^ 



< e2vRp'BP'c using Cauchy-Schwarz 



By a similar argument, and using triangle inequality ( along with 82 <^ < p'b) ^^ get 



Yj^.,®B',®C',- Y^ KA{r)KB ■ A^^(r) ® ^^s(r) ® C', 



r(i[R] 



re[R] 



< 2e2V R{pbPc + PaPc) 



Let us take linear combinations given by unit vectors v and w, of the given tensor T = [A B C] 
along the first and second dimensions. By combining the above inequality along with the fact that 



the two decompositions are e-close i.e. 



ErG[i?] Ar®Br®Cr- A'^®B[.® G, 



< e, we have 



\\Z - Z'\\<£3 = e + 2e2Rp'c{p'A + p'b) where 



r€[R] 



re[R] 
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Note that the e term above is neghgible compared to the second term involving £2- 
We know that vr^i / vr^, so there exist s ^ t ^[E\ such that r* = 7r^(s) = 'Ksit)- We will now use 
this r* to pick v and w carefully so that the vector Z' is negligible while Z is large. We partition 
[R] into V,W with \V\ = kA - 1 and \W\ < ks - 1, so that 7r^(i) G V and nB{s) G W and 
for each r £ [R] — {s,t}, either 7r^(r) G F or -KB{r) G PF. Such a partitioning is possible since 
R<kA + kB-2. 

Let V = span(y) and W = span(l^). We know that r* = 7r^(s) ^ 5 and r* = 7r^(t) ^ T. 
Hence, pick v as unit vector along 11^74,,* and w as unit vector along UyyBr*. By this choice, we 
ensure that Z' = (since f _L V and w _L W). 

However, K-rankT-^(A) > fcyi and K-rank^-^ (i?) > ks, so {v,Ar*){'w,Br*) > 1/tatb (by 
Lemma A. 2). Further, |y| = kA — 1 implies that at most R — kA + ^ < kc — ^ terms of Z is 



non-zero. 



/ ^ PrCr 
re[-R]\y 



< £3 where (3r = {v, Ar) {w, B^ 



^, and since K-rankj-^^ (C) = kc > R— \V\ + 1, we have a contradiction 

our choice of parameters. Hence 
n^ = lie- In the remainder, we 



Further, \j3r*\ > {jatb 
if £3 < {tatbtc)~^ due to Lemma A. 2, This will be true for our choice of parameters. Hence 
Ha 



n^, and similarly H^ = He. Let us denote H = H^ = 
assume H is the identity, since this is without loss of generality. 

To show AaAbAc =e' Ir- 
Let us denote /3i 



Aa(0^b(0^c(0- From (14) and triangle inequality, we have as before 



^ A'^ ® B'^ ® C'^ - Y^ AA(r)AB(r)Ac(r) • A^^(^) ® B^^^^,) C^^^^) 

re[R\ re[-R] 

Combining this with the fact that the decompositions are £-close we get 



< be2WRpAP'BP'c 



^ (1 - Pr)Ar 0Br<S) C,- 

re[R] 



< £4 = e + h^fRpAPBPc^2 < ^^^p'ap'bp'c^2- 



By taking linear combinations given by unit vectors x, y along the first two dimensions (i.e. xA 
and yB) we have 



Y,{'^-^){xAr){yBr)Cr 
re[R] 



< £4. 



We will show each (3r is negligible. Since R + 2 < kA + kB, let S,W CI [R\ — {r} be disjoint sets 
of indices not containing r, such that IS"! = /c^ — 1 and \W\ < A;_b — 1. Let S = span({^j : j G S}) 
and W = span({i3j : j G W}). Let x and y be unit vectors along 11-^ Ar and Hy^Br respectively. 

Since K-rankT-^(A) > kA and K-rank^-^ (i?) > fc^, we have that H^A^ ^ ^/ta (similarly for 



Br). Hence, from Lemma A. 2 



^1 - Pr){ ) \\Cr\\ < £4 =^ 1- Pr < £aTaTbTC- 

tatb 
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Thus, IIAaA^Ac — /|| < EiTATBTc < s' (our choice of e will ensure this). This implies the 
theorem. 

Let us now set the e for the above to hold (note that -i^xi] involves a r term which depends on 

%i]) 

e' 
e :- 



Q{Rtatbtc)p'aP'bP'c ■ %2tel' 
which can easily be seen to be of the form in the statement of the theorem. This completes the 
proof. n 

3.3 A Robust Permutation Lemma 



Let us now prove the robust version of the permutation lemma (Lemma 3.1 ). Recall that K-rankT-(X) 
and K-rankT-(y) are > k, and that the matrices X, Y are n x R. 

Kruskal's proof of the permutation lemma proceeds by induction. Roughly, he considers the 
span of some set of i columns of X (for i < k), and proves that there exist at least i columns of 
Y which lie in this span. The hypothesis of his lemma implies this for i = k — 1, and the proof 
proceeds by downward induction. Note that i = 1 implies for every column of X , there is at least 
one column of Y in its span. Since no two columns of X are parallel, and the number of columns 
is equal in X, Y, there must be precisely one column, and this completes the proof. 

A natural way to mimic this proof is to say: for each set of i columns in X , there exist a set of 
at least i columns in Y which are £i close to the span of the chosen columns in X. The difficulty 
with this is that we lose a factor of rn in each iteration, i.e., if the statement was true for i + 1 with 
error e^+i, it will be true for i with error Ei = rn ■ Sj+i. This means that to obtain a small error 
at the end, we should have started off with error < l/(m) , which is exponentially small. Thus 



we need a more tricky inductive statement and additional observations (including Lemma 3.4) to 
overcome this issue. 

We start by introducing some notation. If 1/ is a matrix and S a subset of the columns, we 
denote by span(l^) the span of the columns of V indexed by S. The next two lemmas are crucial 
to the analysis. 

Lemma 3.5. Let X be a matrix as above. Let A,BC1 [R], with \B\ = q and A D B = 9. For 
1 < ^ < 9; define Ti to be the union of A with all elements of B except the ith one (when indexed 
in some way). Suppose further that \A\ + \B\ < k. Then if y G M" is e-close to span(XT-) for each 
i, it is in fact -ife^i- e close span{XA), where i fe^] := Anrp. 

Proof. W.l.o.g., let us suppose B = {1, . . . , q}. Also, let Xj denote the jth column of X. From the 
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hypothesis, we can write: 



y = ni + ^ aijXj + zi 
2/ = U2 + ^ 0:2jXj + Z2 



y = Uq + ^arjXj + Zq, 

where Ui G span(Xj^) and Zj are the error vectors, which by hypothesis satisfy \\zi\\2 < s. We wih 
use the fact that |A| + |-B| < A; to conclude that each Uij is tiny. This then imphes the desired 
conclusion. 

By equating the first and ith. equations (i > 2), we obtain 

ui + ^ aijXj + zi = Uj + ^ OLijXj + Zi. 

Thus we have a combination of the vectors Xi being equal to Zi — zi, which by hypothesis is small: 
||zj — Z1II2 < 2e. Now the key is to observe that the coefficient of xi is precisely an, because it is 
zero in the ith equation. Thus by Lemma A.l| (since K-rankT-(X) > k), we have that |aii| < 2re. 



Since we have this for all i, we can use the first equation to conclude that 
\\y — "^1112 < /_^ W'i-j\ \\^j\\2 ~^ Il-^ill2 — 2(7T/9e + e < Anrpe 

The last inequality is because q < n, and this completes the proof. D 

A counting argument lies at the core of the inductive proof. We present it in terms of sunflower 
set systems, since it allows for a clean presentation. 

Definition 3.6 (Sunflower set system). A set system J^ is said to be a "sunflower on [R] with core 
T*" if J" C 2[^1, and for any Fi, F2 e T, we have Fi n F2 G T*. 

Lemma 3.7. Let {Ti,T2, . . . ,Tq}, q >2, be a sunflower on [R] with core T* , and suppose \Ti\ + 
IT2I + • • • + \Tq\ > R + (q — 1)9, for some 9. Then we have \T*\ > 9, and furthermore, equality 
occurs iff T* C Tj for all 1 < i < q. 

Proof. The proof is by a counting argument. By the sunflower structure, each Tj has some inter- 
section with r*, and some elements which do not belong to Tj/ for any i' ^ i. Call the number of 
elements of the latter kind tj. Then we must have 

R+{q-l)9<Y^ \T^\ = Y^iU + |T, n T*|) < ^ ti + q\T*\. 

i i i 

Now since all Ti C [i?], we have 



Y,U + \T*\ <R. 
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Combining the two, we obtain 

R+{q-l)6<R+{q-l)\T*\ =^ \T*\>9, 

as desired. For equality to occur, we must have equality in each of the places above, in particular, 
we must have |Tj n T*| = |r*| for all i, which implies T* C Tj for all i. D 

Finally, we introduce a bit more notation before getting to the proof. For S C [R] of size (/c — 1), 
we define T5 to be the set of indices corresponding to columns of Y which are ei-close to span{Xs), 



where ei := {nR)e, and e is as defined in the statement of Lemma 3.1, For smaller sets S, we 
define: 

\S'\ = {k-l),S'DS 



With the above lemmas in place, we can prove Lemma 3.1 



Proof of Lemma 3. 1 . We first prove the following claim by induction: 



Claim. For every S C [R\ of size < {k — 1), we have \Ts\ = \S\. 

We do this by downward induction on \S\. For \S\ = k — 1, the hypothesis of the theorem implies 
that \Ts\ > k — 1. To see this, let V be the (n — /c + 1) dimensional space orthogonal to the span 
of Xs, and let t be the number of columns of Y which have a projection > ei onto V. From 



Lemma A. 3 (applied to the projections to V), there is a unit vector w G V with dot-product 
of magnitude > ei/Rn = e with each of the t columns. From the hypothesis, since w G V 
( =^ nz{w'^X) < R — k + 1), we have t < R — k + 1. Thus at least {k — 1) of the columns are 
ei-close to span(X5). Now since K-rankT-(y) > k, it follows that k columns of Y cannot be ei-close 



the {k — l)-dimensional space span{Xs) (Lemma A. 2). Thus \Ts\ = k — 1. 



Now consider some S of size |5| < k — 2. W.l.o.g., we may suppose it is {R— \S\ + 1, . . . , R}. Let 
Wi denote T^uij}, iov 1 < i < R — \S\, and let us write q = R — \S\. By the inductive hypothesis, 
\Wi\ > |5| + 1 for alH. 

Let us define T* to be the set of indices of the columns of Y which are ei • '%31-close to span{Xs)- 
We claim that Wi n Wj C T* for any i ^ j. This can be seen as follows: first note that Wi n Wj 
is contained in the intersection of Tg', where the intersection is over 5' such that \S'\ = k — 1, and 
S' contains either i or j. Now consider any k — \S\ element set B which contains both i,j (note 
151 < k — 2). The intersection above includes sets which contain S along with all of B except the 



rth element (indexed arbitrarily), for each r. Thus by Lemma 3.5, we have that Wj n Wj C T* . 
Thus the sets {Wi, . . . , Wq} form a sunflower family with core T* . Further, we can check that 



the condition of Lemma 3.7 holds with 6 = \S\: since \Wj\ > l^l + 1 by the inductive hypothesis, 
it suffices to verify that 

R+{q- l)\S\ < q{\S\ + 1), which is true since R = q+\S\. 

Thus we must have \T*\ > \S\. 

But now, note that T* is defined as the columns of Y which are £1 ■ tfesi-close to span{Xs), and 



thus \T*\ < \S\ (by Lemma A. 2), and thus we have |r*| = |S|. Now we have equality in Lemma 3.7 



and so the 'furthermore' part of the lemma implies that T* C Wi for all i. 
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Thus we have Tg = Hj ^i — ^* (the first equahty fohows from the definition of T^), thus 
completing the proof of the claim, by induction. 

Once we have the claim, the theorem follows by applying to singleton sets. Let S = {i\. Now 
if y is a column of Y which is in span(X5/) for all [k — 1) element subsets S' (of [i?]) which contain 



i, by Lemma 3.5 we have y being e\ ■ ^^i-close to span(X|j|), which implies ||y — axi||2 < ei • %3i . 



Since this is true for each column i, and since k > 2 the lemma follows. D 

3.4 Uniqueness Theorem for Higher Order Tensors 

We show the uniqueness theorem for higher order tensors by a reduction to third order tensors as 
in [SBOOj . This reduction will proceed inductively, i.e., the robust uniqueness of order i tensors is 
deduced from that of order {i — 1) tensors. We will convert an order i tensor to a order (£ — 1) 
tensor by combining two of the components together (say last two) as a ng-ini dimensional vector 
({/(^-i) (g) if{^) say). This is precisely captured by the Khatri-Rao product of two matrices: 

Definition 3.8 (Khatri-Rao product). Given two matrices A (size rii x R) and B (size n2 x R), 
the (nin2) x R matrix M = AQ B constructed with the i*'* column equal to Mj = Ai® Bi (viewed 
as a vector) is the Khatri-Rao product. 



Lemma A. 4 in the appendix relates the K-rank of A Q B with kA = K-rank,-^ (A) and ks = 
K-rankT-2(-B). It shows that K-TaiakT-^T-2^{A Q B) = min{/c^ + ks — 1,^}, for some 'd. This turns 
out to be crucial to the proof of uniqueness in the general case, which we present now. 

Outline. The proof proceeds by induction on i. The base case is ^ = 3, and for higher i, the 
idea is to reduce to the case of ^ — 1 by taking the Khatri-Rao product of the vectors in two of 
the dimensions. That is, if [U'-^^ [7*^^^ . . . t/^^^] and [V^^^ V'^'^^ . . . V^^'^] are close, we conclude that 
[[/(I) C/(2) . . . (f/C^-i) C/W)] and [F^^) F^^) ^ _ _ (y{^-i) q yW)] are close, and use the inductive 



hypothesis, which holds because of Lemma | A. 4 we mentioned above. We then need an additional 



step to conclude that if ^4 i? and C Q D are close, then so are A, C and B, D up to some loss 



(Lemma A.5|- this is where we have a square root loss, which is why we have a bad dependence on 



the e' in the statement). We now formalize this outline. 



Proof of Theorem 2.1 . We will prove by induction on i. The base case of ^ = 3 is established by 
Theorem |2.6[ Thus consider some £ > 4, and suppose the theorem is true for i — \. Furthermore, 
suppose the parameters e and e' in the statement of Theorem 2.7 for {i — 1) be e£_i and s'(^_^. We 
will use these to define si and e^ which correspond to parameters in the statement for i. 

Now consider U^^' and V^'^' as in the statement of the theorem. Let us assume without loss of 
generality that ki > k2 > ■ ■ ■ > k£. Also let K = J2je\£] ^r ^^ ^^^^ ^^^ combine the last two 
components (£ — 1) and (. by the Khatri-Rao product. 

ij = t/(^-i) c/W and V = y(^-i) yw. 

Since we know that the two representations are close in Frobenius norm, we have 



r&[R] r&[R] 



< Ei (15) 

F 
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Let us first check that th e con ditions for {£ — l)-order tensors hold for f = {t£_iT£\' K) < 
{ti^iti\/3R). From Lemma A. 4, K-rankf (C/) > min{A;^ + /c^-i — 1, R}. 

Suppose first that ki + kg-i < R+ 1, then 

^ f^j> Yl kj + ke_i + ke-l>2R+{i-l)-l. 

je[i~i] je[e~2] 

Otherwise, if ki + ki^i > R+ 1, then fc£„3 + ki_2 > R + 2 (due to our ordering, and i > 4). Hence 

Y^ k'j>{£-4) + {R + 2) + {R + l)>2R + i-l 

We now apply the inductive hypothesis on this {£ — l)th order tensor. Note that p < (pi-ipi), 
p' < {p'e-iPi), T < hTe^iTiVRj and n = n^-in^. 

We will in fact apply it with e^_]^ < min{(i? • r^-ir^ • p'i_\p'()~'^ ■, {e'^)'^ /R}, so that we can later 



use Lemma I A. 51 To ensure these, we will set 



e;^ = ^ ( ;7 ) • I n '^^^J ' Pj ' P'j ' '^i) 1 %2l('^' P^ P' ^ 



e 



n 



ue[^-2] 



where 'd^m = x '^ •* . From the values of f , p, h above, this can easily be seen to be of the form in 
the statement of the theorem. 

The inductive hypothesis implies that there is a permutation matrix n and scalar matrices 
{AW,A(2),...,A(^-2),A'}, suchthat ||A«A(2)...A(«-2)A'-/|| < 4_^ and 



Vj G [£ - 2] 



y(i) _ f/OOnA^^') 

V - c/nA' 



<Q-i 



< £o 



Since ^i_^ < e^, equation ^ is satisfied for j G [^—2]. We thus need to show that \\V^^^ — [/(■'^IIA*^-'^ ||^ < 
e'p for j = i — 1 and i. To do this, we appeal to Lemma A. 5, to say that if the Frobenius norm of 



the difference of two tensor products u v and u' (8) v' is small, then the component vectors are 
nearly parallel. 

Let us first set the parameters for applying Lemma A.5[ Each column vector is of length at 
most Lmax < p' < (p^__iP^) and length at least L^i^ > 1/f > {2ti^iTi\/R) . Hence, because of our 

choice of e^_^ <^ ( 4\/^(T^-i'rf)(p£_iP£) ) earlier, the conditions of Lemma 



5 < e'o. Let 5r 



Vr - ^.(.)A'(r) 



A.5 



are satisfied with 



Now applying Lemma A.5 with 5 = 5r, to column r, we see that there are scalars ar{t — 1) and 
ar[tj such that 



\l - ar{l - l)ar{e)\ < 



t2 

inin 



<4 
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By setting for all r G [R], A^^ ^^(r) = a{£ — l)r and A^^\r) = a(2)A'(r), we see that the first part 
of ([5| is satisfied. Finally, Lemma A. 5 shows that 



yj €{£-!, £} V}^^ - ulf.A^^^ (r) < 7^ , Vr € [R] 



T(rr 

y(i) _ ;70')nA(^') 



This completes the proof of the theorem. 



< 



i?V4 



by Cauchy-Schwartz inequality). 



< eV 



D 



We show a similar result for symmetric tensors, which shows robust uniqueness upto permuta- 
tions (and no scaling) which will be useful in applications to mixture models (Section^. 

Corollary 3.9 (Unique Symmetric Decompositions). For every < r] < 1, r, p, p' > andi, R gN, 
3ei = "mmh, R, n, r, p, p') such that, for any £-order symmetric tensor (with £ < R) 



where the matrix U is p-bounded with K-rankr{U) = k > '^ f^ + 1, and for any other p' bounded, 
symmetric, rank-R decomposition of T which is e-close, i.e., 



E 

re[R] i=i 



Vr 



relR] j=i 



< e 



there exists an R x R permutation matrix 11 such that 

\\V-UU\\p<r] 



(16) 



The mild intricacy here is that applying Theorem 2.7 gives a bunch of scalar matrices whose 
product is close to the identity, while we want each of the matrices to be so. This turns out to be 
easy to argue - see Section |A.1[ 



4 Computing Tensor Decompositions 

For matrices, the theory of low rank approximation is well understood, and they are captured using 
singular values. In contrast, the tensor analog of the problem is in general ill-posed: for instance, 
there exist rank-3 tensors with arbitrarily good rank 2 approximations |Lanl2j . For instance if u, v 
are orthogonal vectors, we have 

u'Siv(^v + v(^u(^v + v(^v(^u = -[{v + en) (5D (v + eu) (^ {v + eu) — v ®v (^ v\ + M , 



where ||A/ Hp' < 0(e), while it is known that the LHS has rank 3. However note that the rank- 
2 representation with error e uses vectors of length 1/e, and such cancellations., in a sense are 
responsible for the ill-posedness. 

Hence in order to make the problem well-posed, we will impose a boundedness assumption. 
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Definition 4.1 (p-bounded Low-rank Approximation). Suppose we are given a parameter R and 
an m X n X p tensor T which can be written as 

R 

T = Y^ai(g)bi(g)Ci+J\f, (17) 

where Oj G M™, 6j G M", Cj G M^ satisfy max{||aj||2 , ||^j||2 ) l|ci||2} ^ Pi and AA is a noise tensor which 
satisfies ||AA||p < e, for some small enough e. The p-bounded low-rank decomposition problem asks 
to recover a good low rank approximation, i.e., 

R 

i=l 

such that a^, 6', c'j are vectors with norm at most p, and ||AA'||^ < 0(1) • e. 



We note that if the decomposition into [A B C] above satisfies the conditions of Theorem 2.6 
then solving the p-bounded low-rank approximation problem would allow us to recover A,B,C up 
to a small error. The algorithmic result we prove is the following (restated version of Theorem |2.8[). 



Theorem 4.2. The p-hounded low-rank approximation problem can be solved in time poly{n) ■ 
exp(i?2iog(i?p/e)). 

In fact, the 0(1) term in the error bound A/"' < 0(1) -e will just be 5. Our algorithm is extremely 
simple conceptually: we identify three i2-dimensional spaces by computing appropriate SVDs, and 
prove that for the purpose of obtaining an approximation with 0(e) error, it suffices to look for 
ai,bi, Ci in these spaces. We then find the approximate decomposition by a brute force search using 
an epsilon-net. Note that the algorithm has a polynomial running time for constant R, which is 
typically when the low rank approximation problem is interesting. 

Proof. In what follows, let Ma_ denote the m x np matrix whose columns are the so-called j, kth. 
modes of the tensor T, i.e., the m dimensional vector of Tijk values obtained by fixing j, fc and 
varying i. Similarly, we define Mb {n x mp) and Mc {p x mn). Also, we denote by A the m x R 
matrix with columns being Cj . Similarly define B {n x R),C {p x R). 

The outline of the proof is as follows: we first observe that the matrices Ma,Mb,Mc are all 
approximately rank R. We then let Va,Vb and Vc be the span of the top R singular vectors of 
Ma_,Mb and Mc respectively, and show that it suffices to search for aj,5j, and Cj in these spans. 
We note that we do not (and in fact cannot, as simple examples show) obtain the true span of the 
Oj, bi and q's in general. Our proof carefully gets around this point. We then construct an e-net for 
Va, Vb, Vc, and try out all possible i?-tuples. This gives the roughly exp(i2^) running time claimed 
in the Theorem. 

We now make formal claims following the outline above. 

Claim 4.3. Let Va be the span of the top R singular vectors of Ma, and let Ha be the projection 
matrix onto Va (i.e., Hav is the projection of v ^ M" onto Va)- Then we have 

\\Ma-UaMa\\f<£ 
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Proof. Because the top R singular vectors give the best possible rank-i? approximation of a matrix 
for every R, for any i?-dimensional subspace S, if II^ is the projection matrix onto S, we have 

\\Ma - HaMaWf < \\Ma - HsMaWf 

Picking S to be the span of the vectors {ai, . . . , an}, we obtain 



||MA-n5M^||^< 



<e. 



The first inequality above is because the j, fcth mode of the tensor ^- ai f^ bi Ci is a vector in 
the span of {ai, . . . , ur}, in particular, it is equal to ^^ bi{j)ci{k)ai, where bi{j) denotes the jth 
coordinate of 6j. 

This completes the proof. D 

Next, we will show that looking for ai,bi, Ci in the spaces Va,Vb,Vc is sufficient. The natural 
choices are HAO^i, ^Bh, Hcq, and we show that this choice in fact gives a good approximation. For 
convenience let Oj := li-Aai, and a^ := Oj — Oj. 

Claim 4.4. For T,VA,ai, . . . as defined above, we have 



T - M - y^^aj bj (^ a 



< 3e. 



Proof. The proof is by a hybrid argument. We write 

T - 7\A - ^ Sj (g) 6i (g) Cj = ( ^ Oj (g) 6i (g) Cj - aj 

i i 

+ {/^aj bj i^ Cj - aj 

i 

+ ( ^ Oj (g 6i (g Cj - aj 



) bi g) Ci) 
bi (g Ci) 
bi®Ci). 



We now bound each of the terms in the parentheses, and then appeal to triangle inequality (for 
the Frobenius norm). Now, the first term is easy: 



y ai^bi® Ci — ai®bi® Ci 



\\MA-IlAMA\\F<e. 



One way to bound the second term is as follows. Note that: 
^ Oj (g 6j gD Cj - Oj (g 6i (g Cj = f ^ Sj (g 6j (g Cj - Oj (g foj (g Cj j + f ^ a^ (g 6i 



' Ci - a,- 



Now let us denote the two terms in the parenthesis on the RHS hy G,H - these are tensors which 
we view as mnp dimensional vectors. We have ||G + -ff II2 < e, because the Frobenius norm of the 
LHS is precisely \\Mb — HbMbWp < e. Furthermore, {G,H) = 0, because (di^aj-) = for any i,j 
(one vector lies in the span Va and the other orthogonal to it). Thus we have ||G||2 < e (since in 
this case ||G + H\\l = \\G\\l + \\H\\l). 

A very similar proof lets us conclude that the Frobenius norm of the third term is also < e. 
This completes the proof of the claim, by our earlier observation. D 
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The claim above shows that there exist vectors Sj, 6j, q of length at most p in Va, Vb, Vc resp., 
which give a rank-i? approximation with error at most 4e. Now, we form an e/{Rp'^)-net over the 
ball of radius p in each of the spaces Va,Vb,Vc- Since these spaces have dimension R, the nets 
have size 

^^ ^ ^ < exp{0{R) logiRp/e)). 

Thus let us try all possible candidates foii^ai,bi,Ci from these nets. Suppose we have ai,bi,Ci 
being vectors which are e/(6i?/9^)-close to Oj, 6j, q respectively, it is easy to see that 



^ Oj (g) 6i (g) Cj - (Xj (g) 6i (g) Cj 



< ^ Ui (g 6i (g) Cj - Oj (g) 6j (g) Cj 



Now by a hybrid argument exactly as above, and using the fact that all the vectors involved are 
< /9 in length, we obtain that the LHS above is at most e. 

Thus the algorithm finds vectors such that the error is at most 5e. The running time depends 
on the time taken to try all possible candidates for 3R vectors, and evaluating the tensor for each. 
Thus it is poly(m,n,p) • exp(0(i?^) log(i?p/e)). D 

This argument generalizes in an obvious way to order i tensors, and gives the following. We 
omit the proof. 

Theorem 4.5. There is an algorithm, that when given an order i tensor of size n with a rank R 
approximation of error e (in \\-\\p), finds a rank-R approximation of error 0{ie) in time poly{n) ■ 
exp{0{m'^)log{£Rp/e)). 

5 Polynomial Identifiability of Latent Variable and Mixture Mod- 
els 

We now show how our robust uniqueness theorems for tensor decompositions can be used for 
learning latent variable models, with polynomial sample complexity bounds. 

Definition 5.1 (Polynomial Identifiability). An instance of a hidden variable model of size m with 
hidden variables set T is said to be polynomial identifiable if there is an algorithm that given any 
r] > 0, uses only A^ < poly(m, l/rj) samples and finds with probability 1 — o(l) estimates of the 
hidden variables T' such that ||T' — T||^ < r/. 

Consider a simple mixture-model, where each sample is generated from mixture of R distri- 
butions {"Drlrelfi]' ^^^^ mixing probabilities {wr}re[R]- Here the latent variable h corresponds to 
the choice of distribution and it can have [R] possibilities. First the distribution h = r is picked 
with probability Wr, and then the data is sampled according to Vr, which has mean Hr G M"^. Let 
MnxR represent the matrix of these R means. The goal is to learn these hidden parameters (M and 
weights {wr}) after observing many samples. This setting captures many latent variable models 
including topic models, HMMs, gaussian mixtures etc. 

While practitioners typically use Expectation-Maximization (EM) methods to learn the param- 
eters, a good alternative in the case of mixture models is using the method of moments approach 
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( starting from the work by Pearson |Pea94j for univariate gaussians ), which tries to identify the 
parameters by estimating higher order moments. However, one drawback is that the number of 
moments required is typically as large as the number of mixtures R (or parameters) , resulting in a 
sample complexity that is exponential in R [MVini iBSTnl IF( )Sn5[ [FSOn6] . 

In a recent exciting line of work |MR06l IAHK121 IHK121 lAFH+121 IAGH+12| . it is shown 
that poly(-R, n) samples suffice for identifiability in a special case called the non-singular or non- 
degenerate case i.e. when the matrix M has full rank (rank = RjFj for many of these models. 
Their algorithms for this case proceed by reducing the problem of finding the latent variables 
(the means and weights) to the problem of decomposing Symmetric Orthogonal Tensors of or- 
der 3, which are known to be solvable in poly(n, R) time using power-iteration type methods 
|KR01llznoniAGH+12| . 

However, their approach crucially relies on these non-degeneracy conditions, and are not robust: 
even in the case when these i?-means reside in a (i? — l)-dimensional space, these algorithms fail, 
and the best known sample complexity bounds in many of these settings are exp(i?)poly(n). In 
many settings like speech recognition and image classification, the dimension of the feature space 
n is typically much smaller than R, the number of topics or clusters. For instance, the (effective) 
feature space corresponds to just the low-frequency components in the fourier spectrum for speech, 
or the local neighborhood of a pixel in images (SIFT features |Low99| ) . These are typically much 
smaller than the different kinds of objects or patterns (topics) that are possible. Further, in other 
settings, the set of relevant features (the effective feature space) could be a space of much smaller 
dimension (k < R) that is unknown to us even when the feature vectors are actually represented 
in a large dimensional space {n^ R). 

In this section, we show that we can use our Robust Uniqueness results for Tensor Decom- 



positions (Theorem 2.6 and Theorem 2.7) to go past the non-degeneracy barrier and prove that 
poly(i?, n) samples suffice even under the milder condition that no k = 6R gaussians lie in a (fc — 1) 
dimensional space (for some constant 5 > 0). Further, these results generalize to other hidden 
variable models like Topic Modeling, Hidden Markov models, Mixture models etc. One interesting 
aspect of our approach is that, unlike previous works, we get a smooth tradeoff : we get polynomial 
identifiability under successively milder conditions by using higher order tensors {i ~ 2/5). This 
reinforces the intuition that higher moments capture more information at the cost of efficiency. 

In the rest of this section, we will first describe Multi-view models and show how the robust 
uniqueness theorems for tensor decompositions imply polynomial identifiability in this model. We 
will then see two popular latent variable models which fit into the multi-view mixture model: the 
exchangeable (single) Topic Model and Hidden Markov models. We note that the results of this 
section (for i = 3 views) also apply to other latent variable models like Latent Dirichlet Allocation 
(LDA) and Independent Component Analysis (ICA) that were studied in |AGH"'"12 . We omit the 
details in this version of the paper. 

5.1 Multi-view Mixture Model 

Multi-view models are mixture models with a latent variable /i, where we are given multiple ob- 
servations or views x'^-*, x'^\ . . . , x^^' that are conditionally independent given the latent variable 
h. Multi-view models are very expressive, and capture many well-studied models like Topic Mod- 



For polynomial identifiability, or > l/poly(n). 
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els |AHK12j . Hidden Markov Models (HMMs) |MR06l EMRM IAHK12] . random graph mixtures 
|AMR09j . We first introduce some notation, along the lines of |AMR09tE!HK12] . 

Definition 5.2 (Multi-view mixture models). 



The latent variable /i is a discrete random variable having domain [R] , so that Pr [/i = r] = 

Wr,yr e [R]. 

The views {x^-'^}j(^\e] B^re random vectors G M" that are conditionally independent given h, 
with means /i'-'' G M" i.e. 



E 



e(^')|/i 



/i^.-'^ and E x^'^ x^^^\h = r = /i^ ,g, ^(i) for i / j 



Denote by M^^', the n x R matrix with the means {fJ-r }re\R] comprising its columns i.e. 



m(^') 



[,?\ 



,(j) 



(i)i 



^^'A■■■\^^'^ 



The entries (domain) of x^^^ are bounded by Cmax i-e. \\x 



(i)| 



< Cr, 



Q 



The parameters of the model to be learned are the matrices {Af "^}jgm and the mixing weights 
{wr}rG[i?l- ^"^ many settings, the n-dimensional vectors x"^ are actually indicator vectors (hence 
Cmax = 1): this is commonly used to encode the case when the observation is one of n discrete 
events. Allman et al [AMR09J refer to these models by finite mixtures of finite measure products. 

The following lemma shows how to obtain a higher order tensor (to apply our results from 
previous sections) in terms of the hidden parameters that we need to recover. It follows easily 
because of conditional independence. 



Lemma 5.3 ( |AMR09llAHK12] ). In the notation established above for multi-view models, V^ G N 
the £ moment tensor 



E 



X^^) (g) 



x^-'^ (g) 



M 



re[R] 



WrIJ., 



(1) 



^, 



(2) . . . 



^i4^^ 



'l^l^^. 



In our usual representation of tensor decompositions, 



E 



x^^) (g) . . . x^-'^ (g . . . x^^^ 



mW M^^^ ... mW 



Recall that K-rankT-(M) corresponds to the minimum number k such that every nxk submatrix 
M' of M has ak{M') > 1/r. Intuitively this says that, no set of k vectors from fJ.re[R] ^H lis close 
to a A: — 1 dimensional space. 

When k = K-rank-r(M) > R for each of these matrices (the non-degenerate or non-singular 
setting), Anandkumar et al. JAHK12J give a polynomial time algorithm to learn the hidden variables 
using only poly(i?, r, n) samples (hence polynomial identifiability) . However, their algorithm fails 



'in general, we can also allow them to be continuous distributions like multivariate gaussians. 
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even when k = R— 1. We now how to achieve polynomial identifiability even when k = 5R for any 
constant 5 > 0. 



Theorem 2.9 (Polynomial identifiability of Multi-view mixture model). The following statement 



holds for any constant integer L Suppose we are given samples from a multi-view mixture model 



(see Def 5.2), with the parameters satisfying: 

(a) For each mixture r G [R\, the mixture weight Wr > "f- 

(b) For each j G [i], K-rankr{M^^^) >k>^ + l. 

then there is a algorithm that given any r] > uses N = - ^Tof \-,R,n,T, I/7, Cmax ) samples, 

and finds with high probability {M^^'}j^!gj and {wr}r^[R] (upto renaming of the mixtures {1, 2, . . . , R^) 
such that 



VjG 



M(j) - M^J') 



< rj and Vr € [i?], \wr — Wr\ < rj (18) 

F 



Further, this algorithm runs in time exp iR £ 2 log(^) + ^log(n • '^^^'"^ ) I poly{n) time. 

Note that the above theorem shows polynomial identifiability (for constant i), and additionally 
gives an algorithm which takes time n^^^^ )poly(T, R, Cmax) for inverse polynomial error. The func- 
tion i fegf '{■,...,■) = poly{Rn/{'yr]))'^ poly(n, r, 1/7)^ is a polynomial for constant £ and satisfies 
the theorem. 

Remarks: 

1. Note that the condition (a) in the theorem about the mixing weights Wr > J is required to 
recover all the parameters, since we need poly{l/wr) samples before we see a sample from 
mixture r. However, by setting 'j <^ e' , the above algorithm can still be used to recover the 
mixtures components of weight larger than e' . 

2. While these results give new polynomial sample complexity guarantees when n < R, they are 
interesting even when the dimension of the space n^ R. A natural setting where this arises 
is when many of the vectors lie in a unknown space of much smaller dimension (k-dims), 
while the whole space has high dimension. 

3. The theorem also holds when for different j, the K-rankT-{M^^') have bounds kj which are 
potentially different, and satisfy the same condition as in Theorem \2? 



Proof. We will consider the £ moment tensor for £ = [2/5] -|- 1. The proof is simple, and proceeds 
in three steps. First, we use enough samples to obtain an estimate T of the £^'^ moment tensor T, 
upto inverse polynomial error. Then we find a good rank-i? approximation to T (it exists because 
T has rank R). We then use the Robust Uniqueness theorem for tensor decompositions to claim 
that the terms of this decomposition are in fact close to the hidden parameters. 

Set r]' = jSj^. We know from Lemma 



C.l 



that the £ moment tensor can be estimated to 



accuracy 

ei = K-%if^H^/V) -^Zlt^/T. Cmax \/ra,Cmaa.^,n)j in \\-\\ p noTTci nsing N = O [e^"^ R{cmaxY \/T\ogn) 

samples. This estimated tensor T has a rank-i? decomposition upto error ei. 
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Next, we will apply our algorithm for getting approximate low-rank tensor decompositions from 
Section 4 on T. Since each /i^ is a probability distribution, we can obtain vectors {ur }je[e],rG[R] 
(let us call the corresponding n x R matrices U^^') such that 



\/je[i-l],rG[R] 



-,(j) 



G [1-6,1 + 6] where 6 = eWR < 



2f 



This is possible since the algorithm in Section |4| searches for the vectors Ur ■, by just enumerating 
over e-nets on an i?-dimensional space. An alternate way to see this is to obtain any decomposition 
and scale all but the last column in the matrices U^^' so that they have li norm of 1 (upto error 5). 
Note that this step of finding an e-close rank-i? decomposition can also just comprise of brute force 
enumeration, if we are only concerned with polynomial identifiability. Hence, we have obtained a 
rank-i? decomposition which is 0(£ei) far in || • \\f. 

to a^^ moment tensor T to claim that these JJ^^' are close to M^^> 



Now, we apply Theorem 



2.7 



upto permutations. When we apply Theorem 2.7 we absorb the co-efficients Wr into M^>. In other 
words 

UiJ) = m(^') for aU j G [^ - 1], and U^^^ = M^^^ disig{w) . 

We know that K-rankT-(M'-^') = kj, and K-rank^/^(C/'-^)) = k^. We now apply Theorem 2.7 with 
our choice of ei, and assuming that the permutation is identity without loss of generality, we get 



Vr G [R] 
and 



v^'^ 



ui'^ 



A.'~^\r)fi. 



U) 



<n' < 



rj-f 



16n£ 



VjG 



1] 



aW( 



rjWrfJ., 



W 



<v < 



16£n 



for some scalar matrices Aj (on i?-dims) such that 



< 



V 



lUn 



Note that the entries in the diagonal matrices Aj (the scalings) may be negative. We ffist transform 
the vectors so that each of the entries in Kj are non-negative (this is possible since the product of 
Aj is close to the identity matrix, which only has non- negative entries). 



Vj E [£],r G [i?], 4^')=sgn 



r)]-u. 



U) 



This ensures that 



-,(i) 



-M) 



VjG[^-l],re[i?] 
Vr e [R] 

(i) 
Moreover, the fif correspond to probabi 



A(i)(, 
AW(r 



Hi: 



U) 



<7]' < 



VI 



WrH). 



(^) 



<v' < 



16n£ 
VI 



and 



lUn 



(19) 

(20) 
(21) 



that 
satisfy: 



vi^^ 



G [1 — 5, 1 -|- (5] . Applying Lemma 



ity v ectors which have ||^''-'^||i = 1, we have ensured 
A. 6 we get that the required estimates Vr (i-e. /ir ) 



VjG[^-l],rG[i?], 



v^^ 



fi. 



ij) 



< 



VI 



and 



A(j)(r 



VI 



,1 



8£Jn 
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Now, set jjr 
we get that 



41 



-, and Wr 



■M) 



for all r € [R\. Now, from equations (5.1) and (5.1) 



Vre[i?] A(^)(r)-1 



< 



7?7 



Hence from (21), 



W^^) - Wr^J!f^ 



< 



n 
4x/n 



Wrfl^p - Wr^l^r^ 



Wr 



Wr 



< 



VI 



< 



4y/n 
VI 



Using the fact that Wr > ^ and using Lemma A. 6 we see that Wr and /i- 

(i) 

to Wr and ^r respectively, for all r. 



,-,w 



are also jy-close estimates 
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We will now see two popular latent variable models which fit into the multi-view mixture model: 
the exchangeable (single) Topic Model and Hidden Markov models. We note that the results of 
this section (for i = 3 views) also apply to other latent variable models like Latent Dirichlet 
Allocation (LDA) and Independent Component Analysis (ICA) that were studied in |AGH+12] . 
Anandkumar et al. jAFH"'"12[ |AGH"'"12| show how we can obtain third order tensors by looking at 
"third" moments and applying suitable transformations. Applying our robust uniqueness theorem 
(Theorem 2.6) to these 3-tensors identify the parameters. We omit the details in this version of 
the paper. 



5.2 Exchangeable (single) Topic Model 

The simplest latent variable model that fits the multi-view setting is the Exchangeable Single Topic 
model as given in |AHK12J . This is a simple bag-of- words model for documents, in which the words 
in a document are assumed to be exchangeable. This model can be viewed as first picking the topic 
r G [R\ of the document, with probability Wr- Given a topic r G [i?], each word in the document 
is sampled independently at random according to the probability distribution ^r £ IK" {n is the 
dictionary size). In other words, the topic r G i? is a latent variable such that the £ words in a 
document are conditionally i.i.d given r. 

The views in this case correspond to the words in a document. This is a special case of the multi- 
view model since the distribution of each of the views j G [i] is identical. As in |AHK12tlAGH^12] . 



we will represent the i words in a document by indicator vectors x 



«,x(2),...,xWG{0,ir(c„_ 
1 here). Hence, the (ii, Z2, . . . , ii) entry of the tensor E [x*-"*^' x^"^' (^ . . . x^'^^ corresponds to the 
probability that the first words is ii, the second word is ^2) ■ • • and the i^^ word is i^. The following 



is a simple corollary of Theorem 2.9 



Corollary 5.4 (Polynomial Identifiability of Topic Model). The following statement holds for any 
constant 5 > 0. Suppose we are given documents generated by the topic model described above, 
where the topic probabilities of the R topics are {wr}r^\B]' '^^^ ^^^ probability distribution of words 
in a topic r are given by ^r G I^" (represented as a n-by-R matrix M). If\/r G [R\ Wr > 7, and if 
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Q — >Q^ >(z^ *(^ — ►O 



X\ X2 



-^'i+1 



X2<, A'2( 




Figure 1: An HMM with 2q + 1 time steps. ^.^^^^ ^. Embedding the HMM into the Muhi- 

view model 



K-rankr{M) >k> 2R/i + 1, 

then there is a algorithm that given any 77 > uses N = -i ^gf ' ( -, iZ, n,r, I/7, 1 1 samples, and 

finds with high probability M' and {w^jr-efffl such that 



\M — M'\\p<ri and Vr G [i?], \wr — w'J < rj 



(22) 



Further, this algorithm runs in time n 



OdR'log{J-)fnr\^^'^ 



time. 



5.3 Hidden Markov Models 

The next latent variable model that we consider are (discrete) Hidden Markov Model which is 
extensively used in speech recognition, image classification, bioinformatics etc. We follow the 
same setting as in [XMR09J : there is a hidden state sequence Zi, Z2, ■ ■ ■ , Z^, taking values in 
[R] , that forms a stationary Markov chain Zi — t- Z2 — )■ • • • — t- Zm with transition matrix P and 
initial distribution w = {wr}re[-Rl (assumed to be the stationary distribution). The observation 
Xt is from the set of discrete eventaj {1, 2, . . . , n} and it is represented by an indicator vector in 
x^*-* G M". Given the state Zt at time t, Xt (and hence x^*-*) is conditionally independent of all 
other observations and states. The matrix M (of size nx R) represents the probability distribution 
for the observations: the r column Mr represents the probability distribution conditioned on the 
state Zt = r i.e. 

Vr e [R],yj G [n], Pr [Xj = i\Z,j = r] = Mir. 

The HMM model described above is shown in Fig. [l} 

Corollary 5.5 (Polynomial Identifiability of Hidden Markov models). The following statement 
holds for any constant (5 > 0. Suppose we are given a Hidden Markov model as described above, 
with parameters satisfying : 

(a) The stationary distribution {wr}re[-Rl ^^^ ^'^ ^ [-^] ^r > 71, 

(b) The observation matrix M has K-rankr{M) > k > 5R, 

(c) The transition matrix P has minimum singular value o"/j(P) > 72, 

then there is a algorithm that given any r] > uses N = i!^2M^^ ( ^1 -Rj ^) ''"1 :z~:z~ ) samples 

of m = 2\-g~\ +3 consecutive observations (of the Markov Chain), and finds with high probability, 

®in general, we can also allow xt to be certain continuous distributions like multivariate gaussians 
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P',M' and {'Wr}re[R] such that 

||M-M'||^ < 7?, \\P - P'\\p,<r] and\/re[R], \wr - Wr\ < V 



(23) 



Further, this algorithm runs in time n ^ '^vji'' ( n . 



7172 



Os{l) 



time. 



Proof sketch. The proof follows along the lines of Allman et al |AMR09] , so we only sketch the 
proof here. We now show to cast this HMM into a multi-view model (Def. |5.2[) using a nice trick 



of |AMR09] ■ We can then apply Theorem 2.9 and prove identifiability (Corollary |5.5| ). We will 
choose m = 2q + 1 where g = [^1 + 1, and then use the hidden state ^g+i as the latent variable 
h of the Multi-view model. We will use three different views (£ = 3) as shown in Fig. ^ the first 
view A comprises the tuple of observations (Xg, Xg-i, . . . , Xi) (ordered this way for convenience), 
the second view B is the observation X^+i, while the third view C comprises the tuple V3 = 
(X|j-|-2; ^(j+3; • ■ • ) X2q-\-i). This fits into the Multi-view model since the three views are conditionally 
independent given the latent variable h = Zg^i. 

Abusing notation a little, let A, B, C be matrices of dimensions n"^ x i?, n x i?, n'^ x i? respectively. 



They denote the conditional probability distributions as in Definition 5.2 For convenience, let 
P = diag(t(;)P^diag(w)~^, which is the "reverse transition" matrix of the Markov chain given by 
P. We can now write the matrices A, B, C in terms of M and the transition matrices. The matrix 



product X QY refers to the Khatri-Rao product (Lemma A.4). Showing that these are indeed the 



transition matrices is fairly straightforward, and we refer to Allman et al. ^AMR09] for the details. 



A = {{... [MP] M)P) QM)...P)Q M)P 

B = M 

C ={{... {MP) M)P) @M)...P)Q M)P 



(24) 
(25) 
(26) 



(There are precisely q occurrences of M,P (or P) in the first and third equalities). Now 
we can use the properties of the Khatri-Rao product. For convenience, define C^^' = MP, and 
(j{j) ^ (c(i-i)0M)P for j > 2, so that we have C = C^^). By hypothesis, we have K-rank^(M) > k, 
and thus K-rankT-2r(-A^-P) > k (because P is a stoc hast ic matrix with all eigenvalues > T2). Now by 
the property of the Khatri-Rao product (Lemma A.4), we have K-Tank/^j.^\^[C^'^') > in.m{R,2k}. 
We can continue this argument, to eventually conclude that K-rankT-/(C''')) = in.m{R, qk} = R for 
t' = T'i-if {qky/'^ . 

Precisely the same argument lets us conclude that K-rankT-/(^) > R, for the r' = t'^'^2 (Qk)'^''^. 



Now since K-rankr^B) > 2, we have that the conditions of Theorem 2.6 hold. Now using the 



arguments of Theorem 2.9 (here, we use Theorem 2.6 instead of Theorem 2.7), we get matrices 
A',B' , C and weights w' such that 

11^' — ^11 p < & and similarly for B, C 
^w' — w^ < 6 

for some 5 = poly(l/r/, . . . ). Note that M = B. We now need to argue that we can obtain a good 
estimate P' for P, from A' ,B' ,C' . This is done in |AMR09] by a trick which is similar in spirit to 
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Lemma I A. 5 It uses the property that the matrix C above is full rank (in fact well conditioned, as 
we saw above), and the fact that the columns of M are all probability distributions. 

Let D = C^'^~^' , as defined above. Hence, C = {D Q M)P. Now note that all the columns of 
M represent probability distributions, so they add up to 1. Thus given D M, we can combine 
(simply add) appropriate rows together to get D. Thus by performing this procedure (adding rows) 
on C, we obtain DP. Now, if we had performed the entire procedure by replacing q with (g — 1) 
(we should ensure that {q — l)k > R for the Kruskal rank condition to hold), we would obtain the 
matrix D. Now knowing D and DP, we can recover the matrix P, since D is well-conditioned. D 



Remark: Allman et al. jAMRO!^ show identifiability under weaker conditions than Corollary 5.5 
when they have infinite samples. This is because they prove their results for generic values of the 
parameters M, P (this formally means their results hold for all M, P except a set of measure zero, 
hut they do not give an explicit characterization). Our hounds are weaker, hut hold whenever the 
K-rankriM) > 5n condition holds. Further, the main advantage is that our result is robust to 
noise: the case when we only have finite samples. 

5.4 Mixtures of Spherical Gaussians 

Suppose we have a mixture of R spherical gaussians in M", with mixing weights wi,W2, • . -wr, 
means i^i, iJ,2, . ■ ■ , fJ-r, and the common variance o"^. Let us denote this mixture distribution by V, 
and the n x R matrix of means by M. 

We define the /i-tensor of £th order to be 

i 

The empirical mean [i := Monii, and can be estimated by drawing samples x ~ -D, and 
computing E [x] . Similarly, we will show how to compute Mom^ for larger i by computing higher 
order moment tensors, assuming we know the value of a. We can then use the robust Kruskal's 



theorem (Theorem 2.7) and the sampling lemma (Lemma C.2) to conclude the following theorem 



Theorem 5.6. Suppose we have a mixture of gaussians given hy T>, with hidden parameters 
{wr}re[R] ^'^^ ^ r^'^ particular, we assume we know aV\ Suppose also that Mr € [R] Wr > 7, 
and K-rankriM) = k for some k > 5R. 



Then there is a algorithm that given any rj > and a, uses N = -i ^^gf \~j R^iT'^'^j^/i 
samples drawn from T>, and finds with high prohahility M' and {w'^^^^lR] such that 

\\M -M'W Kt] and Vr G [ii], \wr - w'A < r] (27) 



Further, this algorithm runs in time n "^ ^ M tt I 



time. 



Proof. This will follow the same outline as Theorem 2.9 , So, we sketch the proof here. The theorem 



works for error polynomial itej] being essentially as the same error polynomial in ifem However, we 



'^As will be clear, it suffices to know it up to an inverse polynomial error, so from an algorithmic viewpoint, we 
can "try all possible" values. 
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first need to gain access to an order ^-tensor, where each rank-1 term corresponds to a mean fir- 
Hence, we show how to obtain this order-£ tensor of means, by subtracting out terms involving a, 
by our estimates of moments upto i. 

Pick £ = Igl + 2. We will use order i tensors given by the i moment. We will first show 
how to obtain Mom^, from which we learn the parameters. The computation of Mom^ will be done 
inductively. Note that Momi is simply E [x] . Now observe that 



E[a 



E [x (g) x] = E 



E 



y^ Wiifli + Ei) (g> (m + Ei 



y^ Wifii (g) Hi 



■E 



E 



WjEi (g Ei 



Mom2 + a I. 



We compute E [x®^] by sampling, and since we know a, we can find Mom2 up to any polyno- 
mially small error. In general, we have 



E 



^Cgi^ 



+ e, 



\(^e 



y^ wiE Ufii 



y^w» y^ E [xi 

i Xje{fii,ei} 



•X2 



(28) 
(29) 



The last summation has 2 terms. One of them is /if , which produces Mom^ on the RHS. The 
other terms have the form x\®X2®---® x^, where some of the Xj are /ij and the rest e^, and there 
is at least one £«. 

If a term has r terms being fn and l — r being e,, the tensor obtained is essentially a permutation 
of Jiir^i) := ixf"^ (g Ej '' . By permutation, we mean that the (ji, . . . , J£)th entry of the tensor 
would correspond to the (J7r(i)) • • • ; J7r{^))th entry of Jl{r,£), for some permutation vr. Thus we focus 
on showing how to evaluate the tensor /x(r, i) for different r, i. 

Note that if £ — r is odd, we have that E [/i(r, ^)] = 0. This is because the odd moments of a 
Gaussian with mean zero, are all zero (since it is symmetric). If we have i — r being even, we can 
describe the tensor E e-~'^ explicitly as follows. Consider an index (ji, . . . , ji-r), and bucket the j 

into groups of equal coordinates. For example for index (1, 2, 3, 2), the buckets are {(1), (22), (3)}. 
Now suppose the bucket sizes are bi, . . . ,bt (they add up to £ — r). Then the {ji, . . . , J£_r.)th entry 
of E^ ^ is precisely the product mi,^mh2 ■ ■ -mbt: where nis is the sth moment of the univariate 
Gaussian A/'(0,(T^). 



The above describes the entries of E 



_£-T- 



Now E \p.{r, tj] is precisely Mom,, ig E e^ ^ (since 

the /ij is fixed). Thus, since we have inductively computed Monir for r < i, this gives a procedure 
to compute each entry of E [/2(r, £)]. Thus each of the 2^ terms in the RHS of (28) except Mom^ 



can be calculated using this process. The LHS can be estimated to any inverse polynomial small 
error by sampling (Lemma |C.2 ). Thus we can estimate Mom£ up to a similar error. 

Hence, we can use the algorithm from Section [4] and apply Corollary |3.9| to obtain vectors 
{ur}r^[R] such that 



Vr G [R] 



w^l'l,, 



< rj. 
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r/-close approximations to w^ fir- Now, we appeal to Lemma 



Similarly, applying the same process with Mom^ (the Kruskal con ditio ns also hold for i — 1) we get 

to obtain {'u;r.,//r'}re[/?]- ^ 



5.7 



Remark: Note that the previous proof worked even when the gaussians are not spherical: they 
just need to have the same known covariance matrix S. 



The following lemma (used in the proof of Theorem 5.6) allows us to recover the weights after 
obtaining estimates to Wr fJr and Wr Vr through decompositions for the £—1 and the £ moment 
tensors. 

Lemma 5.7 (Recovering Weights). For every S' > 0,w > 0,L^m > 0,i € N, 35 = ^(^^^^P^^) 
such that, if fJ- £ M" be a vector with length \\fi\\ > Lmin; o-nd suppose 



V - VU^^^fi 



< 5 and 



^_y;l/(^-l)^ 



<<5. 



Then, 



\{u,v)\ 



\u\ 



e.{i-i) 



w 



<6' 



(30) 



Proof. From (5.7) and triangle inequality, we see that 



nj-y^y _ w-'/(^-^)u 



Let ai = w ^/'^ ^' and a2 = w ^'^. Suppose v = Pu + eu± where u± is a unit vector perpendicular 
to u. Hence /3 = {v,u) / \\u\\. 



\aiv — a2U\\ 



\ ~ II r2 

ai — a2)u + aieu± \\ < di 



{(3ai - a2f ||nf + aje'^ < 6^ 



/3 



02 
Ol 



< 



Lr, 



Now, substituting the values for qi, 02, we see that 



f]-yj(e-i) e 



< 



61 



Lr 



/3^(^-i) - w 



< 



i/(^-i)L, 



w 



< 5 when 6 <^ 



t J-/rriiv| 



n 



The following Corollary establishes polynomial identifiability for mixtures of uniform spherical 
gaussians under milder conditions than JHK12J (in particular, the means need not be in general 
position). The difference now is that we do not assume we know a. 
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Corollary 5.8. Suppose we have a mixture T) of R-gaussians in n-dimensions with n > R, with 
hidden parameters {wr}r&[R]; ^^ '^^'^ ^- Suppose Vr G [R\ Wr > 7, and that K-rankr{M) = k for 
some k > 6R. 

Then there is a algorithm that given any r] > 0, uses N = '(^m ( -,R,n,T, I/7) samples drawn 

from V, and finds with high probability a', M' and {w'^}r&[B] such that 

II M — -/Vf' 11^ < r] and Vr G [i?], \wr — w'^\ < r] and \a — a'\ < rj (31) 

Further, this algorithm runs in time n ■'^ ( ^ ) time. 

Proof sketch. We first obtain a to inverse polynomial accuracy, using an elegant trick of [HK13J . 



and then apply Theorem 5.6 to identify the parameters M and weights {wr}re[R]- 

To estimate a, we consider the matrix A = ¥,[{x — Momi) (8) (x — Momi)], and note that the 
estimated n*'* singular value an{A) £ [a — r],a — rj] after averaging enough samples (see Theorem 1 
in |HK13j for details). This is because the R vectors //j — Momi live in a (i? — 1) < n — 1 dimensional 
space. Hence, we can obtain a to any inverse polynomial accuracy ( |HK13j for details). This allows 



to recover the parameters using Theorem 5.6 We omit the details in this version. D 



6 Discussion and Open Problems 

The most natural open problem arising from our work is that of computing approximate small 
rank decompositions efficiently. While the problem is NP hard in general, we suspect that well 
conditioned assumptions regarding robust Kruskal ranks being sufficiently large, as in the uniqueness 



theorem (Theorem 2.6) for decompositions of 3-tensors for instance, could help. In particular, 



Question 6.1. Suppose T is a 3-tensor, that is promised to have a rank R decomposition [ABC], 
with kA = K-rankT-{A) (similarly ks and kc) satisfying kA + ks + kc > 2R + 2. Can we find the 
decomposition A, B, C (up to a specified error e) in time polynomial in n, R and 1/e? 

In the special case that the decomposition [A B C] is known to be orthogonal (i.e., the columns 
of A,B,C are mutually orthogonal), which in particular implies n > R, then iterative methods 
like power iteration [AGH^12 , and "alternating least squares" (ALS) |CLdA09] r] converge in 



polynomial time. 

A result in the spirit of finding weaker sufficient conditions for uniqueness was by Chiantini 
and Ottaviani |CU12j . who use ideas from algebraic geometry (in particular a notion called weak 
defectivity) , to prove that generic nxnxn tensors of rank k < n^/16 have a unique decomposition 
(here the word 'generic' is meant to mean all except a measure zero set of rank k tensors, which 
they characterize in terms of weak defectivity). Note that this is much stronger than the bound 
obtained by Kruskal's theorem, which is roughly 3n/2. It is also roughly the best one can hope 
for, since every 3-tensor has rank at most n^ (and a random tensor has rank > n^/2). It would 
be very interesting to prove robust versions of their results, as it would imply identifiability for a 
much larger range of parameters in the models we consider. 

^This is the method of choice in practice for computing tensor decompositions. 
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A third question is that of certifying that a given decomposition is unique. Kruskal's rank 
condition, while elegant, is not known to be verifiable in polynomial time. Given an n x R matrix, 
certifying that every k columns are linearly independent is known to be NP-hard |Kha951 ITP12| . 
Even the average case version i.e. when the matrix is random with independent gaussian entries, has 
received much attention as it is related to certifying the Restricted Isometry Property (RIP), which 
plays a key role in compressed sensing |CT051 IKZllj . It is thus an fascinating open question to 
find uniqueness (and robust uniqueness) theorems which involve parameters that can be computed 
efficiently. 

From the perspective of learning latent variable models, it would be very interesting to obtain 
efficient learning algorithms with polynomial running times for the settings considered in Section [5| 
Recall that we give algorithms which need only polynomial samples (in the dimension n, and 
number of mixtures R), when the parameters satisfy the robust Kruskal conditions. Note that 



an affirmative answer to Question 6.1 (and its higher order analogue) would already imply such 



efficient learning algorithms. Finally, we believe that our approach can be extended to learning 
the parameters of general mixtures of gaussians [MVlOl IBSIO ]. mixtures of product distributions 
[FOS05J . and more generally to a broader class of parameter learning problems. 
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A A Medley of Auxiliary Lemmas 

We now list some of the (primarily linear algebra) lemmas we used in our proofs. They range in 
difficulty from trivial to 'straightforward', but we include them for completeness. 

Lemma A.l. Suppose X is a matrix in M"^ with au > 1/t. Then if WY^idiXiW^ < e, for some 

Ui, we have \\a\\ - 



Proof. From the singular value condition, we have for any t/ G M , 

\\Xy\\l>al\\yf, 
from which the lemma follows by setting y to be the vector of aj. 
Lemma A. 2. Let A G M"^^ have K-rankr = k and be p-bounded. Then, 



D 



1. If S = span(5), where S is a set of at most k — 1 column vectors of A, then each unit vector 
in S has a small representation in terms of the columns denoted by S: 

1 






(p^ + l)k 



<(E^^')/ii^ii'^°^^^i^''i} 



2. If S = span(S') where S is any subset of k — 1 column vectors S of A, the other columns are 
far from the span S : 



Vi G [R] \ S, 



U^A, 



> 



1 



3. If S is any i-dimensional space with i < k, then at most i column vectors of A are e-close to 
itfore = \|(T^^t): 

1 



{^■. 



n^A 



< 



-^i 



< 



Proof. We now present the simple proofs of the three parts of the lemma. 

1. The first part simply follows because from change of basis. Let M be the n x n matrix, 
where the first \S\ columns of M correspond to S and the rest of the n — 15*1 columns being 
unit vectors orthogonal to S. Since A^g is well-conditioned, then Amax(-^) < (p + 1)\/^ ^^^ 
XuihiiM) > 1/maxr, 1. The change of basis matrix is exactly M~^: hence z = (M)~^f. 
Thus, Ai„in(M-i) < ||z|| < Amax(M-i) = l/Ann„(M) < max{l,r}. 
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2. Let S = {1, . . . ,k — 1} and j = k without loss of generality. Let v = ^jg^ ZiAi be a vector 
e-close to A^. Let M' be the n x k matrix restricted to first k columns: i.e. M' = A\sijijy. 
Hence, the vector z = (zi, . . . , Zk-i, —1) has square length 1 + J2i ^fi ^^^ 11-^' -^ll = £• Thus, 



e>Xmin{M')l + Y,^f>i/r 



3. Let £ = 1/(tvA?)- For contradiction, assume that S = {i : ||n^ylj|| < e} is of size i + 1. Let 
Vi = TigAi E S. Since {vjjjgs' are H. + \ vectors in a ^ dimension space. 



3{Qj}ig5' with ^^a^i = 1, s.t ^OjUi = 



Hence, ||X]ies'^«^«ll — l|Sie5'^«^5^«ll — (Siesl'^il)^ ^ vl^'k (where the last inequality 
follows from Cauchy-Schwarz inequality). But these set of at contradict the fact that the 
minimum singular value of any n-by-A; submatrix of A is at least 1/r. 

D 

Lemma A. 3. Let ui, . . . , uj G M^ (for some t, d) satisfy \\ui\\2 > e > for all i. Then there exists 
a unit vector w ^W^ s.t. \{ui,w)\ > 2^ for all i G [t]. 

Proof. The proof is by a somewhat standard probabilistic argument. 

Let r ~ M be a random vector drawn from a uniform spherical Gaussian with a unit variance 
in each direction. It is well-known that for any y G M"', the inner product {y,r) is distributed 
as a univariate Gaussian with mean zero, and variance ||y||2- Thus for each y, from standard 
anti-concentration properties of the Gaussian, we have 

Prn/u„r)|<Ml < 1. 
Li\ t, /I _ ^Q^i - 2t 

Thus by a union bound, with probability at least 1/2, we have 

Pr[K^^*,^>l>^] foralH. (32) 

Next, since E ||r||2 = d, Pr[||r||2 > 4(i] < 1/4, and thus there exists a vector r s.t. ||r||2 < 4(i, and 
Eq. (32) holds. This implies the lemma (in fact we obtain \/d in the denominator). D 



Lemma A. 4 (K-rank of the Khatri-Rao product), has K-rank/ ^ / ^ ^^ )(-^) ^ min{A;i + k2 — 
l,R]. 

Proof. Let r = TiT2^/kA + ks. Suppose for contradiction M has K-rankT-(M) < k = kA+ks — l < R 

(otherwise we are done). 

Without loss of generality let the sub-matrix M' of size (nin2) x k, formed by the first k columns 
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of M have \k{M) < 1/r. Note that for a vector z G R^^, 
n X R matrix representing z. Hence 



\Z\\r^ where Z is the natural 

I M _^ 



^{ai}ie[k] with ^ 
ie[fc] 



af 



1 s.t. 



ie[fc] 



aiAi (g) R; 



< e. 



Clearly 3i* G [A;] s.t \ai\ > 1/vfc : let i* = k without loss of generality. Let S = span({Ai, A3, 
and pick x = H-^Ak/ \\ll'^Ak\\ (it exists because K-rankT-(M) < R). 
Pre- multiplying the expression in (|A|) by x, we get 



< e where /3j = ai {x, Ai) 



^Mi 



i=ki 



-4fe.4-l}), 



Again, by Lemma 



A.2 



But |/3fc| > l/(vfcr i) (b y Lemma A.2), and there are only k — k^ + l < ks terms in the expression. 



applied to these (at most) ks columns of B, we get that 1/e < rir2V«?, 
which establishes the lemma. D 



Remark. Note that the bound of the lemma is tight in general. For instance, if A is an n x 2n 
matrix s.t. the first n columns correspond to one orthonormal basis, and the next n columns to 
another (and the two bases are random, say). Then K-rankio(A) = n, but for any r, we have 
K-rankT-(A Q A) = 2n — 1, since the first n terms and the next n terms of A Q A add up to the 
same vector (as a matrix, it is the identity). 



Lemma A. 5. Suppose ||m( 



• V — u 



' ^'IIf < ^' '^'^^ -^min < ll^ll 1 ll^l 



U 



<L„ 



with 6 < 775 Vt^^^^^Mtt- Ifu = Oiiu' + (3iu± and v = a2v' + B2V±, where u± and v± are unit vectors 

orthogonal to u', v' respectively, then we have 

|1 — aia2\ < (5/L^in '^'^^ /^i < v(5i /52 < y^- 

Proof. We are given that u = aiu' + (3iu± and v = a2v' + /32^±- Now, since the tensored vectors 
are close 



\\u^ V — u 'v'Wp < ^ 
II (1 — aia2)u (g)v' + (3ia2U± (^ v' + (32aiu' (g) u_l + /3i/32n_L ^^_l||^ < 5^ 
L^iJl - aia2f + /3?aiiLin + I^WAn + l^fl^i < ^^ 
This implies that |1 — 0102] < (^/-^min ^^ required. 



Now, let us assume /3i > v6. This at once implies that /32 < v^- 



Also 



^2 



ai||t;'|P + /3; 



S < alLl 



Hence, 

Now, using (33), we see that /3i < v5. 



"2 > 



2-'^niax 



(33) 



D 
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Lemma A. 6. For X >0, a vector v E M" with \\v\\-^ G [1 — e/4, l + e/4], a probability vector u G M" 
ni^lli = J2iUi = l), if 

\\v-Xu\\^< 



4:\/n 



then we have 



1 -e/2 < A < 1 + e/2 and \\v - u\\2 < e 



Proof. First we have \\v — Xu\\-^ < e/4 by Cauchy-Schwartz. Hence, by triangle inequality, | A| ||u||;^ < 

l + e/2. 

Since \\u\\^ = 1, we get A < 1 + e/2. Similarly A > 1 — e/2. 

Finally, \\v — u\\2 < \\v — Xu\\2 + |A — 1| ||ti||2 < e (since A > 0). Hence, the lemma follows. D 
A.l Symmetric Decompositions 



Proof of Corollary\3.9[ Applying Theorem 2.7 with e' < r]{2pT\/R) ^, to obtain a permutation 



matrix H and scalar matrices Aj such that 

Vie [i]\\v -UUAjWj, <£' 

By triangle inequality, Vjj' G [i], ||^n(Aj - Aji)\\^ < 2e' 

Since H is a permutation matrix and U has columns of length at least 1/r, we get that 
Vr G [R] , j G [£] , f G ^ , | A,- (r ) - A,v (r ) | < e'r 
However, we also know that 



H' A,-/ 



jee 



<e' 



VrG[fl], {l-e')<\{Aj{i)<l + e' 



Hence, substituting ( A.l ) in the last inequality, it is easy to see that Vi G [n], \Xj{i) — 1| < 2e'r. But 
since each column of ^ is p-bounded, this shows that ||A' — An||^ < 2e'rpvi? < r/, as required. D 



B Properties of Tensors 

B.l A necessary condition for Uniqueness 

Consider a 3-tensor T of rank R represented by [A B C] where these three matrices are of size 
nx R. 

T = y^ Ari^ Br ®Cr. 
re[R] 

We now show a necessary condition in terms of the n^ dimensional vectors Ar Br from the 
decomposition. 
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Claim B.l (A necessary condition for uniqueness). Suppose for a subset S <Z [R\, there exist {or} 
with \\a\\ = 1. 

y^ UrAr 0Br = 



r£S 



then there exists multiple rank-R decompositions for T 

Proof. Consider any fixed non-zero vector u (it can be also chosen to be not close to any of the 
other vectors in S). This is because "^^eS Ar '^ B^ u = ^^.^s Oiri^r (8> Br) iS" u = 0. 
Hence, T = ^^es A "S) ^r "X" {Cr + aru) + ^^, 



G[R]\S 



Ar' (8) B-r 



Cr' ■ 



D 



The above example showed that one necessary condition is that the AQ B should be full rank 
R (and well-conditioned). These examples are ruled out when the Kruskal ranks of A and B are 
such that kyi + ks > Rhy Lemma |A.4[ 

C Sampling Error Estimates for Higher Moment Tensors 

In this section, we show error estimates for £-order tensors obtained by looking at the i^'^ moment 
of various hidden variable models. In most of these models, the sample is generated from mixture of 
R distributions {^r}re[-R]' with mixing probabilities {wr}r^[R]- First the distribution Vr is picked 
with probability Wr, and then the data is sampled according to Di, which is characteristic to the 
application. 

Lemma C.l (Error estimates for Multiview mixture model). For every ^ G N, suppose we have 
a multi-view model, with parameters {wr}r&[R] o,f^ct {M^'j^gr^i, such that every entry of x^^' G M"" 
is hounded by Cmax (or if it is multivariate gaussian). Then, for every e > 0, there exists N = 

0{c^rnax^~'^^^ log v) SUch that 

if N samples {^^(l) }j6[£i, {x{2y-''}j^!£-\, . . . , {2;(A^) }jg[£i are generated, then with high probability 



1 

N 



< e 



E x(^) ® x(2) (g) . . . x(^) 
Proof. We first bound the || • ||oo norm of the difference of tensors i.e. we show that 






(34) 



V{ii,Z2,...,v} G [nf 



E 



n 



Si) 




\[x{t) 



U) 



< e/ri 



i/2 



Consider a fixed entry (zi, i2, • • • , ie) of the tensor. 

Each sample t G [N] corresponds to an independent random variable with a bound of c^rnax- 
Hence, we have a sum of N bounded random variables. By Bernstein bounds, probability for 

^^ ' — exp {^—e^N/ {2{cmaxnY))- We have n^ events to union 



(34) to not occur exp 



2Nci 



bound over. Hence N = 0{e (cmaxn) VTTogn) suffices. Note that similar bounds hold when the 
X"' G M" are generated from a multivariate gaussian. D 
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Lemma C.2 (Error estimates for Gaussians). Suppose x is generated from a mixture of R-gaussians 
with means {^r}re[i?l '^^'^ covariance a'^I , with the means satisfying \\nr\\ < Cmax- 
For every e > 0,i £ N, there exists N = Q.{poly{j)),a'^ ,n,R) such that if x^^\x^'^\ . . . ,x^^'> £ R^ 
were the N samples, then 



y{ii,i2,...,ie} G [nf 



E 



n 




< e. 



(35) 



In other words, 



E 



1 

N 



E 

te[N] 



(V(xwr) 



< e 



Proof. Fix an element {ii,i2, ■ ■ ■ ,ie) of the ^-order tensor. Each point t £ [N] corresponds to an 
i.i.d random variable Z* = x- x- . . .x), . We are interested in the deviation of the sum S = 
W StGfM ■^*- Each of the i.i.d rvs has value Z = Xi^Xi^ . ..xg. Since the gaussians are spherical 
(axis-aligned suffices) and each mean is bounded by Cmax, \Z\ < {cmax + t<^Y with probability 
O (exp(— 1^/2)). Hence, by using standard sub-gaussian tail inequalities, we get 



Pr|S'-E[z]| >e<exp 



e^N 



(M + cr^logn) 



Hence, to union bound over all n events N = O (e ^(^lognM)^) suffices. 



D 
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